[IEEE 2013 IEEE Symposium on Swarm Intelligence (SIS) - Singapore,, Singapore...

7
Reinforcement learning in swarm-robotics for multi-agent foraging-task domain Yogeswaran M. 1 , Ponnambalam S.G. 2 , Kanagaraj G. 3 School of Engineering Monash University Sunway Campus Jalan Lagoon Selatan, Bandar Sunway, 46150, Selangor Darul Ehsan, Malaysia Email: [email protected] 1 , [email protected] 2 , [email protected] 3 Abstract—The main focus of this paper is to study and develop an efficient learning policy to address the exploration vs. exploitation dilemma in a multi-objective foraging task in swarm robotics domain. An efficient learning policy called FIFO-list is proposed to tackle the above mentioned problem. The proposed FIFO-list is a model-based learning policy which can attain nearoptimal solutions. In FIFO-list, the swarm robots maintains a dynamic list of recently visited states. States that are included in the list are banned from exploration by the swarm robots regardless of the Q(s, a) values associated with those states. The FIFO list is updated based on First- In-First-Out (FIFO) rule, meaning the states that enters the list first will exit the list first. The recently visited states will remain in the list for a dynamic number of time-steps which is determined by the desirability of the visited states. Keywords-swarm robotics; swarm intelligence; reinforcement learning; foraging; Q-learning I. I NTRODUCTION Swarm robotics is the study of how large number of rel- atively simple physically embodied robots can be designed such that a desired collective behavior emerges from the local interactions among robots and the environment. It is inspired from the observation of social insects such as ants, termites, wasps and bees which stand as fascinating examples of how a large number of simple individuals can interact to create collectively intelligent systems [1]. Swarm robotic systems have become a major research area for the past three decades. Foraging task is one of the widely used testbeds in the study of swarm robotics as it has the closest ties to biological swarms. Foraging is the obvious task undertaken to demonstrate swarm intelligence for cooperative robotics [2]. Foraging can be defined as an act of searching and collecting food or prey to the nest which is a borrowed concept from the biological swarms. In the study of foraging swarm robots, foods or preys are normally referred to as pucks and nest as home. Foraging robots may be in the form of single robots operating individually, or a swarm of robots operating collectively [3]. In swarm robotics, foraging is important for several reasons. It is a metaphor for a broad class of problems integrating exploration, navigation and object identification, manipulation and transport. Secondly, foraging task is a canonical problem for the study of cooperative robotics. Reinforcement learning (RL) is a machine learning tech- nique in which a robot learns through trial and error to maximize rewards received for taking particular actions in particular states over an extended period of time [4]. A reward signal is given to a RL robot to show desirability of the taken action towards the perceived state. The RL robots choices of actions in response to states is called its learning policy [5]. A great deal of research have been carried out in RL using swarm robotic systems and applied to many applications namely foraging, soccer, prey-pursuing and box-pushing. In this paper, the modeled foraging-task has multiple objectives. The swarm robots are expected to search and retrieve pucks in the environment to the home location while minimizing collision, maximizing the amount of pucks collected and maximizing the amount of exploration done in the environment within a fixed time frame. Using this setup, empirical analyses on well-known learning policies was conducted. An efficient and improved learning policy is also proposed to further enhance the learning efficiency and the overall performance of the swarm robots. II. PRELIMINARIES In this section, the multi-objective foraging task problem for swarm robotic systems modeled and the relevant parame- ters are defined more precisely. Swarm robots in this context refers to Khepera II mobile robots that comes with proximity sensors, gripper turrets and wireless communication turrets. A. Multi-Objective Foraging Task In a typical central place foraging setup, the environment or search space contains more than a single puck to be col- lected and has a single collection point referred to as home location. As observed in the model, the robots in the swarm is always in one of following modes: Searching, Grabbing, Homing or Deposit. For a typical foraging scenario, the swarm robots initiates in the Searching mode. In Searching mode, the swarm robots are moving randomly or following some strategy to navigate through the environment using its sensors to locate and recognize the pucks. The sensors 15 978-1-4673-6004-3/13/$31.00 c 2013 IEEE

Transcript of [IEEE 2013 IEEE Symposium on Swarm Intelligence (SIS) - Singapore,, Singapore...

Page 1: [IEEE 2013 IEEE Symposium on Swarm Intelligence (SIS) - Singapore,, Singapore (2013.04.16-2013.04.19)] 2013 IEEE Symposium on Swarm Intelligence (SIS) - Reinforcement learning in swarm-robotics

Reinforcement learning in swarm-robotics for multi-agent foraging-task domain

Yogeswaran M.1, Ponnambalam S.G.2, Kanagaraj G.3

School of EngineeringMonash University Sunway Campus

Jalan Lagoon Selatan, Bandar Sunway,46150, Selangor Darul Ehsan, Malaysia

Email: [email protected], [email protected], [email protected]

Abstract—The main focus of this paper is to study anddevelop an efficient learning policy to address the explorationvs. exploitation dilemma in a multi-objective foraging task inswarm robotics domain. An efficient learning policy calledFIFO-list is proposed to tackle the above mentioned problem.The proposed FIFO-list is a model-based learning policy whichcan attain nearoptimal solutions. In FIFO-list, the swarmrobots maintains a dynamic list of recently visited states. Statesthat are included in the list are banned from exploration bythe swarm robots regardless of the Q(s, a) values associatedwith those states. The FIFO list is updated based on First-In-First-Out (FIFO) rule, meaning the states that enters thelist first will exit the list first. The recently visited states willremain in the list for a dynamic number of time-steps whichis determined by the desirability of the visited states.

Keywords-swarm robotics; swarm intelligence; reinforcementlearning; foraging; Q-learning

I. INTRODUCTION

Swarm robotics is the study of how large number of rel-

atively simple physically embodied robots can be designed

such that a desired collective behavior emerges from the

local interactions among robots and the environment. It

is inspired from the observation of social insects such as

ants, termites, wasps and bees which stand as fascinating

examples of how a large number of simple individuals can

interact to create collectively intelligent systems [1]. Swarm

robotic systems have become a major research area for the

past three decades.

Foraging task is one of the widely used testbeds in

the study of swarm robotics as it has the closest ties to

biological swarms. Foraging is the obvious task undertaken

to demonstrate swarm intelligence for cooperative robotics

[2]. Foraging can be defined as an act of searching and

collecting food or prey to the nest which is a borrowed

concept from the biological swarms. In the study of foraging

swarm robots, foods or preys are normally referred to as

pucks and nest as home. Foraging robots may be in the

form of single robots operating individually, or a swarm of

robots operating collectively [3]. In swarm robotics, foraging

is important for several reasons. It is a metaphor for a broad

class of problems integrating exploration, navigation and

object identification, manipulation and transport. Secondly,

foraging task is a canonical problem for the study of

cooperative robotics.

Reinforcement learning (RL) is a machine learning tech-

nique in which a robot learns through trial and error to

maximize rewards received for taking particular actions in

particular states over an extended period of time [4]. A

reward signal is given to a RL robot to show desirability

of the taken action towards the perceived state. The RL

robots choices of actions in response to states is called

its learning policy [5]. A great deal of research have been

carried out in RL using swarm robotic systems and applied

to many applications namely foraging, soccer, prey-pursuing

and box-pushing.

In this paper, the modeled foraging-task has multiple

objectives. The swarm robots are expected to search and

retrieve pucks in the environment to the home location

while minimizing collision, maximizing the amount of pucks

collected and maximizing the amount of exploration done

in the environment within a fixed time frame. Using this

setup, empirical analyses on well-known learning policies

was conducted. An efficient and improved learning policy is

also proposed to further enhance the learning efficiency and

the overall performance of the swarm robots.

II. PRELIMINARIES

In this section, the multi-objective foraging task problem

for swarm robotic systems modeled and the relevant parame-

ters are defined more precisely. Swarm robots in this context

refers to Khepera II mobile robots that comes with proximity

sensors, gripper turrets and wireless communication turrets.

A. Multi-Objective Foraging Task

In a typical central place foraging setup, the environment

or search space contains more than a single puck to be col-

lected and has a single collection point referred to as home

location. As observed in the model, the robots in the swarm

is always in one of following modes: Searching, Grabbing,

Homing or Deposit. For a typical foraging scenario, the

swarm robots initiates in the Searching mode. In Searching

mode, the swarm robots are moving randomly or following

some strategy to navigate through the environment using

its sensors to locate and recognize the pucks. The sensors

15978-1-4673-6004-3/13/$31.00 c©2013 IEEE

Page 2: [IEEE 2013 IEEE Symposium on Swarm Intelligence (SIS) - Singapore,, Singapore (2013.04.16-2013.04.19)] 2013 IEEE Symposium on Swarm Intelligence (SIS) - Reinforcement learning in swarm-robotics

used on the swarm robots should be able to recognize and

distinguish between the object that is being searched and

the obstacles present in the environment. When a robot in

the swarm found a puck, the robot will change mode from

Searching to Grabbing mode.

In case the robot fails to find a puck, the robot will

remain in Searching mode until a puck is found. In Grabbing

mode, the robot physically grabs the object and gets ready

to transport it back to the home region located normally

at the center of the environment. As soon as the puck has

been grabbed the robot will change from Grabbing mode to

Homing mode. In Homing mode, the robot must navigate,

with its collected puck to a home region. When the robot

has successfully reached the home region it will change

to Deposit mode. In Deposit mode, the robot deposits the

puck in the home region, and then immediately changes to

Searching mode and hence resumes its search for the pucks

in the environment.

B. Reinforcement Learning

Reinforcement learning (RL) is learning how to map sit-

uations or states to actions to maximize a numerical reward

signal. A RL robot learns by interacting and observing the

feedback from the environment. This mimics the complex

behavior in animals on how they learn through obtaining

rewards and avoiding punishments. The learning robots are

not told which actions to take but instead must discover

which actions yield the most reward through trial and error.

Actions taken may affect not only the immediate reward

but also the next situation and, through that, all subsequent

rewards. These two characteristics - trial and error, and

delayed reward are the two most important distinguishing

features of RL [4].

In RL framework, the swarm robots observes the state it

is currently in st and then chooses and performs an action

at based on the adopted stochastic rule called the learning

policy. The robot receives an immediate reward rt for the

action taken at following a reward function R. Through

this form of interaction with the environment, the swarm

robots attempts to identify a value function that maps states

to actions to maximize the long term reward. To maximize

the rewards received, the swarm robots must visit the states

in the environment as many times as possible. However such

level of exploitation to the states may lead to sub-optimal

results and prevents the swarm robots from exploring the

environment for better rewards. On the otherhand, promoting

too much exploration impacts heavily on the time constraint.

This problem is commonly referred to as exploration vs.

exploitation dilemma (EED) [6], a core problem in RL.

C. Learning Policies

Learning policies play an important role in balancing the

trade-off in EED. However it is hard to pin point the correct

amount of exploration or exploitation needed since there

exist some problems where a fully explorative or exploitative

approach leads to the most optimal solution. Learning poli-

cies that emphasize exploration degrades the performance of

the swarm robots but increases the flexibility to adapt in a

dynamic environment. On the other hand, learning policies

that emphasize exploitation drives the learning process to

locally optimal solutions.

Many learning policies have been introduced to achieve a

proper balance in the EED in the open literature. However,

the experiments conducted in the open literature are using

simple scenarios such as game theories and small grid

problems. The experiments reported in these papers are

sufficient to distinguish the differences between the learning

policies but it is unclear whether the learning policies

examined can scale up to realistic swarm robotic problems.

Experiments on a more sophisticated search task such as

multi-objective foraging involving a swarm robots may yield

different results. In this section, the adopted learning policies

in RL problems are discussed.

The most basic way of exploration in an unknown en-

vironment is to take actions randomly with uniform prob-

ability. This form of exploration is normally referred to as

random search. Random search is a very primitive method

of exploration and is not widely used in the current stage of

development in RL. This learning policy is also included in

this paper to show the impact of pure exploration.

at = ar (1)

The robots will select a random action at without being

influenced by the immediate reward rt or the Q(s, a)t that

are observed in the environment.

Greedy learning policy is the most commonly used explo-

ration policy that is generally associated with standard RL

algorithms.

at = ag (2)

Following the greedy policy, the robots selects an action

at based on the highest Q(s, a)t estimate of the available

actions At. The action at considered by this learning policy

may depend on actions made so far but not on future actions.

The variation of greedy learning policy is ε-greedy. ε-greedy

allows a certain degree of exploration with a probability of

is a small positive value ε.

at =

{ar ξ ≤ εag otherwise

(3)

High values of ε will force the robots to explore more

frequently. This prevents the robots from selecting the opti-

mal action all the time, while giving the robots the ability to

react rapidly to changes that takes place in the environment.

Low values of ε will drive the robots to exploit more optimal

actions.

16 2013 IEEE Symposium on Swarm Intelligence (SIS)

Page 3: [IEEE 2013 IEEE Symposium on Swarm Intelligence (SIS) - Singapore,, Singapore (2013.04.16-2013.04.19)] 2013 IEEE Symposium on Swarm Intelligence (SIS) - Reinforcement learning in swarm-robotics

The Boltzmann distribution (BD) [7] is a learning policy

that reduces the tendency for exploration as t increases. BD

assigns a positive probability to any possible action at using

Equation (5) and parameter T called temperature [8].

ϕ = Q(s, a)t/T (4)

Pk(s, ak)t =eϕ∑K

k = 1eϕ(5)

Tnew = e−dtTmax + 1 (6)

Actions with high Q(s, a) values are associated with

higher probability ratios. T decreases as t increases. As

learning progresses, the exploration tendency of the swarm

robots reduces and lead to exploitation of actions with high

Q(s, a) values.

The simulated annealing (SA) learning policy was intro-

duced by [9] where the paper reports experimental results

on a 22x17 puzzle problem.

φ = (Q(s, ar)−Q(s, ag))/T (7)

at =

{ar ξ ≤ eφ

ag otherwise(8)

Robots adopting SA learning policy selects an arbitrary

action at and executes it based on the probability defined in

Equation (8). Otherwise the robot selects and executes the

optimal action ag . T is set through trial and error. When T is

high, the SA learning policy promotes more exploration. As

the T drops based on (6), the SA learning policy reduces

the exploration rate and drives the robots to exploit more

optimal solutions.

Koulouriotis et. al. [10] implemented probability matching

(PM) learning policy to examine multi-armed bandit tasks

to solve the EED in comparison with several other learning

policies.

Pmax = 1− (K − 1)Pmin (9)

Pk(s, ak)t = Pmin+(1−K.Pmin)Q(s, ak)t∑K

k = 1Q(s, ak)t(10)

The PM learning policy computes each actions selec-

tion probability Pk(s, ak)t as the proportion of the actions

Q(s, a) value to the sum of all Q(s, a) value estimates. To

enforce a sufficient amount of exploration, the exploration

rate is kept above a threshold of Pmin. Otherwise, an action

which is inefficient in the early time-steps would never be

considered. The exploration rate for each action is also kept

below Pmax.

Adaptive pursuit (AP) learning policy was introduced by

[11]. AP is similar to the PM learning policy as the policy

does not allow probabilities to be under the threshold of

Pmin or exceed the maximum value of Pmax. The AP

learning policy increments the optimal actions probability

towards Pmax using Equation (11) and decrements all re-

maining actions probabilities towards Pmin using Equation

(12).

Pg(s, ag)t = Pg(s, ag)t + α(Pmax − Pg(s, ag)t) (11)

Pi(s, ai)t = Pi(s, ai)t + α(Pmin − Pi(s, ai)t) (12)

R-Max proposed by Brafman et. al. [12] learning policy

maintains an approximate model of the environment. Ini-

tially, the model assumes that all actions to all states lead to

a maximum reward. The model is updated each time a state

becomes known. Using R-Max learning policy, the robot will

always choose the action that is optimal given its computed

Q(s, a)t values. The optimistic values for state-action pairs

visited fewer than C times receives higher priority in terms

of selection. Therefore when an action is to be taken, the

robot will often be encouraged to visit such state-action

pairs.

III. PROPOSED FIFO-LIST LEARNING POLICY

The FIFO-list maintains a list of recently visited states for

the last Nt number of time-steps. States that are included in

the list are not considered as observed states by the mobile-

robots regardless of the Q(s, a)t value associated with the

states.

Each state that has just entered the FIFO-list will remain

in the list for a duration of Nt number of time-steps. The list

follows the FIFO rule. When a new state s1 enters the list,

state sN will be pushed out and state sN−1 will be shifted

to sN in the list. Assuming sn are the states recorded in the

FIFO list and an are the actions leading to states sn, the

robot selects action at based on Equation (13).

at = ag, g �= ∀n (13)

Nt =

⎧⎨⎩

Nt − 1 Q(s, a)nt ew −Q(s, a)ot ld > 0Nt + 1 Q(s, a)nt ew −Q(s, a)ot ld < 0Nt otherwise

(14)

The FIFO list is initialized with Nt empty states. In this

paper, at time-step t = 0, N0 is set to 7. The robot observes

the available states St from the current state st. Actions

leading to the observed states St listed in the FIFO-list

are ignored. The actions leading to the remainder of St

are selected based on Equation (13). If all the observed

states are listed in the FIFO-list, the action at is taken

from Equation (2). The FIFO-list avoids cycling through

the previous sequence of actions an which prompts the

2013 IEEE Symposium on Swarm Intelligence (SIS) 17

Page 4: [IEEE 2013 IEEE Symposium on Swarm Intelligence (SIS) - Singapore,, Singapore (2013.04.16-2013.04.19)] 2013 IEEE Symposium on Swarm Intelligence (SIS) - Reinforcement learning in swarm-robotics

robots to explore aggressively. The size of the FIFO-list

Nt changes based on Equation (14). After the action at is

taken, if Q(s, a)newt value is less than Q(s, a)ot ld value,

Nt is incremented by 1 to promote more exploration. If

the updated Q(s, a)nt ew value is higher than Q(s, a)ot ldvalue, Nt is decremented by 1 to promote more exploitation.

The range of value for Nt is kept between Nmin = 2 and

maximum value of Nmax = 15.

IV. PROBLEM MODELING

This section discusses the implementation of the multi-

objective foraging task. The simulation was modeled and

carried out using the Webots software. The Webots software

was used to closely simulate and forecast real world foraging

tasks using Khepera II mobile robots (refer to Figure 1). The

available actions A for the mobile-robots at each observed

state st are move NORTH, move SOUTH, move EAST and

move WEST.

Figure 1. Simulation Environment modeled using Webots software.

Each mobile-robot has a gripper module which is kept

open and the servo is 0◦ to allow the mobile-robots to

detect pucks with the sensor located in the gripper. Puck

availability G in the gripper combined with the X and Zcoordinates defines the available states in the environment.

The rewards observed in the states are recorded in a Q-table.

Each mobile-robot reads and writes into their respective Q-

tables. The Q(s, a) values in the respective Q-tables are

initiated with the value of 0 at the beginning of the first

run.

Before initializing the simulation, the Q(s, a) values in the

Q-tables are set to 0 (except for R-Max). At the initialization

stage, the time-step t is set to 0. 10 pucks and 3 robots are

placed in their respective coordinates in the environment.

Puck availability G decides whether the robot wanders or

performs homing. If the gripper is holding a puck, G is set

to 1 and the mobile-robot will perform homing. If the gripper

of the mobile-robot is not holding a puck, G is set to 0 and

the mobile-robot will learn to wander in the environment

and continue searching for the pucks.

From the current state st, the robot observes the possible

available actions At. Before action at is taken, each robot

measures its corresponding distance from the neighboring

robots. If the distance between the robots are ≤300mm, the

robot will take action at which has the furthest distance

from the neighboring robot overriding the influence of the

learning policy in the decision making. This method is called

repulsion [13], employed as part of the wander behavior

to serve as an additional force preventing the robots from

clustering within an area and also to prevent collision

between the robots in the swarm.

A reward rt is given for the action at taken for the

observed state st from the set of rewards R defined. The

robots are given rt=+100 for picking up a puck from the

environment. This is to motivate the robots to exploit the

state again in the future. rt=+100 is given to the robot for

dropping the pucks in the home location. This will lead the

robot to approach the home location faster as the learning

progresses over time. rt=-1 is given to the mobile-robots for

encountering with collision in the environment. This is done

to prevent further collision occurance in the environment as

the learning process converges. rt=-1 is given to the robot for

wandering in the environment with a puck in the gripper. As

t increases, the mobile-robots will be motivated to approach

the home location at a faster pace. rt=-1 is also given if the

robot are wandering in the home location without a puck in

the gripper. This forces the robot to leave the home location

and approach the environment to search for pucks.

Q-learning [14] is one of the widely used RL algo-

rithm which approximates the optimal action value func-

tion Q(s, a)t independent of the learning policy π being

followed. The action values Q(s, a)t are updated using (15).

Q(s, a)newt ← Q(s, a)oldt +αrt+αγQ(s, a)t+1−αQ(s, a)oldt

(15)

where Q(s, a)t is the estimated Q(s, a) value for the

current state st at time step t, and at, rt, are the action

and reward at time step t. The Q-learning [15] algorithm

is considered to be an off-policy algorithm because the

convergence of the algorithm does not rely on the learning

policy π in updating the value function. The value function

is updated using the highest estimate of Q(s, a)t+1, meaning

the maximum Q(s, a) value of the observed available state-

action pairs Q(s, a)t+1 from the current state st. However if

a greedy learning policy is used with Q- learning, the learn-

ing converges similar to an on-policy learning algorithm.

The respective Q(s, a) values of the robots are updated

using Equation (15). If the at taken leads to collision in the

environment, the robot will move back to their previous state

st. Otherwise the new state st+1 is set as the current state

st. If t �=30000, t is incremented by 1 and the process flow is

18 2013 IEEE Symposium on Swarm Intelligence (SIS)

Page 5: [IEEE 2013 IEEE Symposium on Swarm Intelligence (SIS) - Singapore,, Singapore (2013.04.16-2013.04.19)] 2013 IEEE Symposium on Swarm Intelligence (SIS) - Reinforcement learning in swarm-robotics

repeated until all 10 pucks are collected or t = 30000. When

either of the termination condition is met, t is set 0 again

and number of runs are incremented by 1. In this paper, for

each learning policy considered, 10 sets of 40 runs were

conducted.

V. RESULTS AND DISCUSSION

The efficiency of the proposed FIFO-list learning policy

and other learning policies was evaluated by measuring their

performances in searching and retrieving the pucks, collision

avoidance and the amount of area explored. Data from 10

sets of 40 runs for each of the learning policy is presented

graphically in this section. The simulation was carried out

with a swarm of 3 Khepera II mobile robots and a total of

10 pucks to be searched and retrieved. Each run lasted for

30000 time-steps. The number of time-steps and runs was

empirically set as the majority of the learning trials stopped

progressing after these set conditions.

Learning Policy Pucks Collisions Exploration

Greedy 5.92 64.88 72.70%

ε-greedy 6.06 68.47 80.50%

Boltzmann Distribution (BD) 2.80 80.03 87.60%

Simulated Annealing (SA) 3.97 84.85 89.76%

Probability Matching (PM) 6.25 55.40 92.30%

Adaptive Pursuit (AP) 6.75 45.25 95.11%

R-Max 6.40 132.50 98.20%

FIFO-List 7.68 43.13 98.03%

Table IRESULTS SUMMARY.

All the learning policies performed well compared against

the random search. The greedy policy performed very well

in overcoming collision in the environment. Although the

average collisions encountered in the environment averages

to 64.88, it can be observed that as the learning progresses.

The greedy learning policy improves to almost 0 collisions

in the environment. The amount of pucks collected averages

to 5.92 pucks a run.

The greedy learning policy allowed 72.70% exploration

of the states in the environment. This can be expected as

the greedy learning policy does not attempt actions with

low Q(s, a) values. Therefore, once there is a collision,

the state will be undesirable to the robot throughout the

remaining of the simulation period and the upcoming runs.

This obvious weakness contributed to the lack of exploration

in the environment.

The parameter to be set for ε-greedy learning policy is ε,which is the probability for the learning robots to explore.

The best ε setting was found to be 0.2 out of the range of

values tested. With these parameters the ε-greedy learning

policy gathered an average of 6.06 pucks. The amount of

collision averages to 68.47 collision per simulation run.

By setting ε to the value of 0.2, the exploration amount

surpassed that of the greedy learning policy which is about

80.50%. The robots adopting ε-greedy learning policy have

no control over when or how to explore. Therefore, the

robots may chose to take a random action when an obvious

action might lead to a better reward.

For BD learning policy, the parameter Tmax is adjusted

to 500 and the decay rate d is set 0.009. These parameter

settings allows the robots to perform marginally better

incomparison with other tested values. High temperatures

causes the robots to select available actions equally while

in the case of T approaching 0, the robot selects its actions

greedily. Selecting the temperature T is not as straightfor-

ward as the selection of ε in the ε-greedy method as we

must take into consideration of the actions expected Q(s, a)values. However, these parameters can be fine-tuned to be-

have desirably according to the problem setup. BD learning

policy achieved 87.60% in exploration amount, average of

80.03 collisions and average of 2.80 pucks gathered. The

convergence of the BD learning policy was very poor for

the time constraint fixed in this paper.

The SA learning policy [9] performed poorly but better

than BD learning policy. This is because of the poor sepa-

rability of the Q(s, a) values. The initial temperature T is

set to 500. SA learning policyenforces additional amount of

exploration ξ in the beginning of the learning process. This

ratio is much higher for the Boltzmann distribution learning

policy, which decreases exploration with the temperature Tdecayed over time. However, with the same temperature

setting, the learning policy gathered an average of 3.97

pucks, 84.85 collision and 89.76% of exploration in the

environment.

The PM learning policy [10] performed well compared

to SA and BD learning policies. The amount of collision

sums to an average of 55.40 collisions per simulation run

and the amount of pucks gathered averages to 6.25 pucks

per simulation run. 92.30% of the modeled environment is

explored. The PM learning policy maintains a minimum

probability of Pmin which allows all the states to be consid-

ered although the states are valued with low Q(s, a) values.

This characteristic allows a decent amount of exploration

in the environment. Main drawback of PM learning policy

would be due to the fact that regardless of how low the

Q(s, a) values of the observed states are, the probability

would be still be considered because of Pmin. This factor

leads to more collisions and diverses the robots from the

actual goal of the experiment. This learning policy cannot

perform well when the differences between the available

actions Q(s, a) values are relatively small.

Pmin is set to 0.1 for both PM and AP learning policies.

The AP learning policy [11] performed desirably well in

reducing the amount of collisions in the environment with an

average of 45.25 collisions per simulation run. An average

of 6.75 pucks were gathered and 95.11% of the states in

2013 IEEE Symposium on Swarm Intelligence (SIS) 19

Page 6: [IEEE 2013 IEEE Symposium on Swarm Intelligence (SIS) - Singapore,, Singapore (2013.04.16-2013.04.19)] 2013 IEEE Symposium on Swarm Intelligence (SIS) - Reinforcement learning in swarm-robotics

the environment were explored. AP learning policy is an

improvement over PM learning policy. The AP learning

policy maintains Pmin similarly to PM but the probabilities

for states with lower Q(s, a) values are biased towards Pmin

while the state with the highest Q(s, a) value is biased

towards Pmax. This characteristic creates a distict gap in

probability between the states allowing the state with the

highest Q(s, a) value to have a higher probability of being

selected. Hence clear improvements can be observed in com-

parison with the PM learning policy in all the objectives set

in this experiment. However, actions leading to states with

low Q(s, a) values are still possible with lower probability

compared to the probability matching learning policy.

Although the R-Max learning policy [12] performed badly

with an average of 132.50 collisions per simulation run,

as the learning process converges the amount of collisions

reduces significantly. The R-Max learning policy maintains

an approximate model of the environment at the beginning

of the learning stage by initializing all the available states

with maximum reward. This characteristic forces the robots

to visit the available states in the environment regardless of

the Q(s, a) value of the states until the states are visited Cnumber of times. This leads to a high amount of collision

in the environment at the beginning of the learning process.

After visiting the available states C amount of times, Rmax

applies ε-greedy learning policy to select between the avail-

able actions. The amount of pucks collected averages to 6.40

pucks per simulation run and 98.20% of the available states

were explored which is highest in comparison with the other

learning policies reported in this paper.

The FIFO-list learning policy outperformed all the learn-

ing policies for collision with an average of 43.13 collisions

a simulation run and amount of pucks gathered averaging

to 7.68 pucks per simulation run. An average of 98.03%of the states in the environment was explored which is

lower compared to R-max learning policy. The learning

robots adopting the proposed method maintains a FIFO list

which prevents the robots from revisiting the recently visited

states for N number of time-steps. This attribute builds a

perimeter that covers a region of recently visited states in the

environment forcing the learning robots to venture further

into the environment. This also allows the robots to avoid

being stuck in a region with locally optimal rewards while

promoting exploration. Higher N value promotes higher

exploration but in return decreases the average amount of

pucks collected and also increases the amount of collision

in the environment. Once the learning process starts to

converge, the list size N maintains its range within 3 to

7.

VI. CONCLUSION

This paper examined the effects of various learning poli-

cies on multi-objective foraging task using swarm robots

and RL techniques. The adopted learning policies from the

open literature are greedy, ε-greedy, Boltzmann distribution,

simulated annealing, probability matching, adaptive pursuit

strategy and R-max. A comparative study was carried out on

each of the learning policies. An improved learning policy,

FIFO-list was introduced in the same problem domain. The

performance of the learning policies were measured based

on the amount of pucks collected, collisions encountered

and explorations made within the simulation period. The

proposed FIFO-list learning policy outperformed all the

other learning policies in collection of pucks and collisions

encountered in the environment. However, the exploration

rate for the proposed method ranks second to the R-max

learning policy although the difference is minimal. Based on

the observed results, it can be concluded that the proposed

FIFO-list learning policy performed very well in comparison

with learning policies in the literature for multi-objective

foraging task using swarm robots.

REFERENCES

[1] M. Brambilla, E. Ferrante, M. Birattari, and M. Dorigo,“Swarm robotics: A review from the swarm engineeringperspective,” IRIDIA, CoDE, Universite Libre de Bruxelles,Brussels, Belgium, Tech. Rep. TR2012-014, 2012.

[2] Y. Cao, A. Fukunaga, and A. Kahng, “Cooperative mobilerobotics: Antecedents and directions,” Autonomous robots,vol. 4, no. 1, pp. 7–27, 1997.

[3] S. Sang-Wook, Y. Hyun-Chang, and S. Kwee-Bo, “Behaviorlearning and evolution of swarm robot system for cooperativebehavior,” in Advanced Intelligent Mechatronics, 2009. AIM2009. IEEE/ASME International Conference on. IEEE, 2009,pp. 673–678.

[4] R. Sutton and A. Barto, Reinforcement learning: An introduc-tion. Cambridge Univ Press, 1998, vol. 1, no. 1.

[5] C. Gaskett, D. Wettergreen, and A. Zelinsky, “Q-learningin continuous state and action spaces,” Advanced Topics inArtificial Intelligence, pp. 417–428, 1999.

[6] O. Kroemer and J. Peters, “Active exploration for robotparameter selection in episodic reinforcement learning,” inAdaptive Dynamic Programming And Reinforcement Learn-ing (ADPRL), 2011 IEEE Symposium on. IEEE, 2011, pp.25–31.

[7] Y. Wang, H. Lang, and C. de Silva, “A hybrid visual servocontroller for robust grasping by wheeled mobile robots,”Mechatronics, IEEE/ASME Transactions on, vol. 15, no. 5,pp. 757–769, 2010.

[8] D. Carmel and S. Markovitch, “Exploration strategies formodel-based learning in multi-agent systems: Explorationstrategies,” Autonomous Agents and Multi-agent Systems,vol. 2, no. 2, pp. 141–172, 1999.

[9] M. Guo, Y. Liu, and J. Malec, “A new q-learning algorithmbased on the metropolis criterion,” Systems, Man, and Cyber-netics, Part B: Cybernetics, IEEE Transactions on, vol. 34,no. 5, pp. 2140–2143, 2004.

20 2013 IEEE Symposium on Swarm Intelligence (SIS)

Page 7: [IEEE 2013 IEEE Symposium on Swarm Intelligence (SIS) - Singapore,, Singapore (2013.04.16-2013.04.19)] 2013 IEEE Symposium on Swarm Intelligence (SIS) - Reinforcement learning in swarm-robotics

[10] D. Koulouriotis and A. Xanthopoulos, “Reinforcement learn-ing and evolutionary algorithms for non-stationary multi-armed bandit problems,” Applied Mathematics and Computa-tion, vol. 196, no. 2, pp. 913–922, 2008.

[11] D. Thierens, “An adaptive pursuit strategy for allocatingoperator probabilities,” in Proceedings of the 2005 conferenceon Genetic and evolutionary computation. ACM, 2005, pp.1539–1546.

[12] R. Brafman and M. Tennenholtz, “R-max-a general polyno-mial time algorithm for near-optimal reinforcement learning,”The Journal of Machine Learning Research, vol. 3, pp. 213–231, 2003.

[13] T. Balch, “The impact of diversity on performance in multi-robot foraging,” in Proceedings of the third annual conferenceon Autonomous Agents. ACM, 1999, pp. 92–99.

[14] C. Watkins and P. Dayan, “Q-learning,” Machine learning,vol. 8, no. 3, pp. 279–292, 1992.

[15] M. Kareem Jaradat, M. Al-Rousan, and L. Quadan, “Rein-forcement based mobile robot navigation in dynamic envi-ronment,” Robotics and Computer-Integrated Manufacturing,vol. 27, no. 1, pp. 135–149, 2011.

2013 IEEE Symposium on Swarm Intelligence (SIS) 21