Environment

The robotic board game environment simulates a board game with robots moving and delivering mails to their destinations. The robotic board game environment is parameterized. See its details below.

What does it look like?

Below rendered example of game between four players. Each of them have 2 robots. One move consumes one unit of battery. Every single robot can make maximum only one move per turn.

Robotic Board Game Example. — Robotic board game example.

Game rule

The board game is inspired by real-world applications, in which robots pick-up mail and deliver them to a workstation. On own turn, the player moves the robot a few cells (how many we decide by parameter) and tries to deliver the required number of mails as soon as possible.

The next table show you details of the field, in which player moves his own robots.

Cell description.
Color of the cell	Description
White	At the beginning of the game, robots are randomly placed in a white cell.
Gray	Robot can move through gray cell and it doesn’t have special properties.
Red	Robot can’t move to red cell.
Yellow	Each yellow cell has a number written on it. Only the robot with the mail that has corresponding index can move to yellow cell and drop off the mail. Robot, after dropping off the mail, must leave this cell in next turn.
Green	Robot can pick up a mail in green cell. Only robot without mail can move to green cell. Robot, after picking up a mail, must leave this cell in the next turn.
Blue	Robot can be charged in blue cell. For every five steps, the robot consumes 1 battery unit. A fully charged robot has N = 10 battery units. The charging robot receives 1 battery unit for every step another robot takes. At the beginning of the game, all robots are fully charged. Only robot with low baterry can move to blue cell. Robot after charging to high battery must leave this cell in next turn.

Game parameters

The robotic board game is parameterized by:

The layout of board.
The required number of mails.
The number of players.
The number of robots for each players.
Battery of robot is considered or not.
Robot can move maximum 1 step per turn or randomly set that number.
Maximum step environment (our game) can reach.

Layout of the board is set with the .csv files, what are accepted as arguments for game contructor. Examples of these files are as follows:

Board layout example.

Type map

Configuration

Caption

Color Map

b,g,y,g,y,g,y,g,b
g,g,g,g,g,g,g,g,g
y,g,w,w,w,w,w,g,y
g,g,w,w,w,w,w,g,g
y,g,w,w,w,w,w,g,y
g,g,w,w,w,w,w,g,g
y,g,w,w,w,w,w,g,y
g,g,gr,g,gr,g,gr,g,g
g,g,r,g,r,g,r,g,g

“b” - blue
“r” - red
“w” - white
“g” - gray
“gr” - green
“y” - yellow

Target Map

0,0,4,0,7,0,5,0,0
0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,6
0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,8
0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,9
0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0

A non-zero number corresponds
to the index of the mail
that the location receives.

Other parameters you can pass directly to contructor. For example:

from rbgame.game.game import RoboticBoardGame
game = RoboticBoardGame(
    colors_map='assets/csv_files/colors_map.csv',
    targets_map='assets/csv_files/targets_map.csv',
    # 10 mails to win
    required_mail=10,
    # red, blue, green and purple players
    robot_colors=['r', 'b', 'gr', 'p'],
    # each player have 2 robot
    num_robots_per_player=2,
    # battery is considered
    with_battery=True,
    # robot can move maximum one step per turn
    random_num_steps=False,
    # maximum environment step
    max_step=1000,
),

For more details, please access the API reference.

Observation space

The observation is a dictionary which contains an 'observation' element which is the usual reinforcement learning observation described below, and an 'action_mask' which holds the legal moves, described in the Legal Actions Mask section.

Observation of the single robot is the vector with size 4. It contains respectively x-coordinate, y-coordinate, mail’s index, battery of that robot. All components are normalized for passing to neural networks. Observations of all robots are concatenated to create main observation. For each specific agent, the observation of the robot that is being controlled is placed first in the main observation vector i.e. first four components of the main observation is the observation of the controlled robot. This ensure possibility of self-play, one agent can play as all robots because it always controls robot with first four features of main observation. Beside, this allow agent learn from not only its own transitions but also from transitions of other agents.

For example, with notation that \((x_i, y_i), m_i, b_i\) is coordinates, mail’s index and battery of \(i\)-th robot respectively, for agent that controls first robot, environment provides vector:

\[\begin{split}\vec{o} = \begin{pmatrix} x_1 \\ y_1 \\ m_1 \\ b_1 \\ x_2 \\ y_2 \\ m_2 \\ b_2 \\ x_3 \\ y_3 \\ m_3 \\ b_3 \\ ... \end{pmatrix}\end{split}\]

For agent that controls second robot:

\[\begin{split}\vec{o} = \begin{pmatrix} x_2 \\ y_2 \\ m_2 \\ b_2 \\ x_1 \\ y_1 \\ m_1 \\ b_1 \\ x_3 \\ y_3 \\ m_3 \\ b_3 \\ ... \end{pmatrix}\end{split}\]

For agent that controls third robot:

\[\begin{split}\vec{o} = \begin{pmatrix} x_3 \\ y_3 \\ m_3 \\ b_3 \\ x_1 \\ y_1 \\ m_1 \\ b_1 \\ x_2 \\ y_2 \\ m_2 \\ b_2 \\ ... \end{pmatrix}\end{split}\]

and so on.

Legal actions mask

The legal moves available to the current agent are found in the 'action_mask' element of the dictionary observation. The 'action_mask' value is a binary vector where each element of the vector represents whether the action is legal or not.

Action space

In this simulation, 5 discrete actions are available for each robot:

Possible actions.
Action ID	Action
0	Stand still. Charge if possible.
1	Make move foward. Pick up or drop off if possible.
2	Make move backward. Pick up or drop off if possible.
3	Make move to left. Pick up or drop off if possible.
4	Make move to right. Pick up or drop off if possible.

Reward

Even with the simplest parameters, agent will not learn anything if reward is too sparse. So let have a Curriculum Learning. Our reward system could be defined as follows:

Pick up a mail = 1
Drop off a mail = 5
Go to blue cell to charge = 1
Otherwise = -0.1 to encourage the agent to try to complete as soon as possible.

For more details, please access the API reference.