[IEEE 2013 IEEE Symposium on Swarm Intelligence (SIS) - Singapore,, Singapore...
Transcript of [IEEE 2013 IEEE Symposium on Swarm Intelligence (SIS) - Singapore,, Singapore...
Reinforcement learning in swarm-robotics for multi-agent foraging-task domain
Yogeswaran M.1, Ponnambalam S.G.2, Kanagaraj G.3
School of EngineeringMonash University Sunway Campus
Jalan Lagoon Selatan, Bandar Sunway,46150, Selangor Darul Ehsan, Malaysia
Email: [email protected], [email protected], [email protected]
Abstract—The main focus of this paper is to study anddevelop an efficient learning policy to address the explorationvs. exploitation dilemma in a multi-objective foraging task inswarm robotics domain. An efficient learning policy calledFIFO-list is proposed to tackle the above mentioned problem.The proposed FIFO-list is a model-based learning policy whichcan attain nearoptimal solutions. In FIFO-list, the swarmrobots maintains a dynamic list of recently visited states. Statesthat are included in the list are banned from exploration bythe swarm robots regardless of the Q(s, a) values associatedwith those states. The FIFO list is updated based on First-In-First-Out (FIFO) rule, meaning the states that enters thelist first will exit the list first. The recently visited states willremain in the list for a dynamic number of time-steps whichis determined by the desirability of the visited states.
Keywords-swarm robotics; swarm intelligence; reinforcementlearning; foraging; Q-learning
I. INTRODUCTION
Swarm robotics is the study of how large number of rel-
atively simple physically embodied robots can be designed
such that a desired collective behavior emerges from the
local interactions among robots and the environment. It
is inspired from the observation of social insects such as
ants, termites, wasps and bees which stand as fascinating
examples of how a large number of simple individuals can
interact to create collectively intelligent systems [1]. Swarm
robotic systems have become a major research area for the
past three decades.
Foraging task is one of the widely used testbeds in
the study of swarm robotics as it has the closest ties to
biological swarms. Foraging is the obvious task undertaken
to demonstrate swarm intelligence for cooperative robotics
[2]. Foraging can be defined as an act of searching and
collecting food or prey to the nest which is a borrowed
concept from the biological swarms. In the study of foraging
swarm robots, foods or preys are normally referred to as
pucks and nest as home. Foraging robots may be in the
form of single robots operating individually, or a swarm of
robots operating collectively [3]. In swarm robotics, foraging
is important for several reasons. It is a metaphor for a broad
class of problems integrating exploration, navigation and
object identification, manipulation and transport. Secondly,
foraging task is a canonical problem for the study of
cooperative robotics.
Reinforcement learning (RL) is a machine learning tech-
nique in which a robot learns through trial and error to
maximize rewards received for taking particular actions in
particular states over an extended period of time [4]. A
reward signal is given to a RL robot to show desirability
of the taken action towards the perceived state. The RL
robots choices of actions in response to states is called
its learning policy [5]. A great deal of research have been
carried out in RL using swarm robotic systems and applied
to many applications namely foraging, soccer, prey-pursuing
and box-pushing.
In this paper, the modeled foraging-task has multiple
objectives. The swarm robots are expected to search and
retrieve pucks in the environment to the home location
while minimizing collision, maximizing the amount of pucks
collected and maximizing the amount of exploration done
in the environment within a fixed time frame. Using this
setup, empirical analyses on well-known learning policies
was conducted. An efficient and improved learning policy is
also proposed to further enhance the learning efficiency and
the overall performance of the swarm robots.
II. PRELIMINARIES
In this section, the multi-objective foraging task problem
for swarm robotic systems modeled and the relevant parame-
ters are defined more precisely. Swarm robots in this context
refers to Khepera II mobile robots that comes with proximity
sensors, gripper turrets and wireless communication turrets.
A. Multi-Objective Foraging Task
In a typical central place foraging setup, the environment
or search space contains more than a single puck to be col-
lected and has a single collection point referred to as home
location. As observed in the model, the robots in the swarm
is always in one of following modes: Searching, Grabbing,
Homing or Deposit. For a typical foraging scenario, the
swarm robots initiates in the Searching mode. In Searching
mode, the swarm robots are moving randomly or following
some strategy to navigate through the environment using
its sensors to locate and recognize the pucks. The sensors
15978-1-4673-6004-3/13/$31.00 c©2013 IEEE
used on the swarm robots should be able to recognize and
distinguish between the object that is being searched and
the obstacles present in the environment. When a robot in
the swarm found a puck, the robot will change mode from
Searching to Grabbing mode.
In case the robot fails to find a puck, the robot will
remain in Searching mode until a puck is found. In Grabbing
mode, the robot physically grabs the object and gets ready
to transport it back to the home region located normally
at the center of the environment. As soon as the puck has
been grabbed the robot will change from Grabbing mode to
Homing mode. In Homing mode, the robot must navigate,
with its collected puck to a home region. When the robot
has successfully reached the home region it will change
to Deposit mode. In Deposit mode, the robot deposits the
puck in the home region, and then immediately changes to
Searching mode and hence resumes its search for the pucks
in the environment.
B. Reinforcement Learning
Reinforcement learning (RL) is learning how to map sit-
uations or states to actions to maximize a numerical reward
signal. A RL robot learns by interacting and observing the
feedback from the environment. This mimics the complex
behavior in animals on how they learn through obtaining
rewards and avoiding punishments. The learning robots are
not told which actions to take but instead must discover
which actions yield the most reward through trial and error.
Actions taken may affect not only the immediate reward
but also the next situation and, through that, all subsequent
rewards. These two characteristics - trial and error, and
delayed reward are the two most important distinguishing
features of RL [4].
In RL framework, the swarm robots observes the state it
is currently in st and then chooses and performs an action
at based on the adopted stochastic rule called the learning
policy. The robot receives an immediate reward rt for the
action taken at following a reward function R. Through
this form of interaction with the environment, the swarm
robots attempts to identify a value function that maps states
to actions to maximize the long term reward. To maximize
the rewards received, the swarm robots must visit the states
in the environment as many times as possible. However such
level of exploitation to the states may lead to sub-optimal
results and prevents the swarm robots from exploring the
environment for better rewards. On the otherhand, promoting
too much exploration impacts heavily on the time constraint.
This problem is commonly referred to as exploration vs.
exploitation dilemma (EED) [6], a core problem in RL.
C. Learning Policies
Learning policies play an important role in balancing the
trade-off in EED. However it is hard to pin point the correct
amount of exploration or exploitation needed since there
exist some problems where a fully explorative or exploitative
approach leads to the most optimal solution. Learning poli-
cies that emphasize exploration degrades the performance of
the swarm robots but increases the flexibility to adapt in a
dynamic environment. On the other hand, learning policies
that emphasize exploitation drives the learning process to
locally optimal solutions.
Many learning policies have been introduced to achieve a
proper balance in the EED in the open literature. However,
the experiments conducted in the open literature are using
simple scenarios such as game theories and small grid
problems. The experiments reported in these papers are
sufficient to distinguish the differences between the learning
policies but it is unclear whether the learning policies
examined can scale up to realistic swarm robotic problems.
Experiments on a more sophisticated search task such as
multi-objective foraging involving a swarm robots may yield
different results. In this section, the adopted learning policies
in RL problems are discussed.
The most basic way of exploration in an unknown en-
vironment is to take actions randomly with uniform prob-
ability. This form of exploration is normally referred to as
random search. Random search is a very primitive method
of exploration and is not widely used in the current stage of
development in RL. This learning policy is also included in
this paper to show the impact of pure exploration.
at = ar (1)
The robots will select a random action at without being
influenced by the immediate reward rt or the Q(s, a)t that
are observed in the environment.
Greedy learning policy is the most commonly used explo-
ration policy that is generally associated with standard RL
algorithms.
at = ag (2)
Following the greedy policy, the robots selects an action
at based on the highest Q(s, a)t estimate of the available
actions At. The action at considered by this learning policy
may depend on actions made so far but not on future actions.
The variation of greedy learning policy is ε-greedy. ε-greedy
allows a certain degree of exploration with a probability of
is a small positive value ε.
at =
{ar ξ ≤ εag otherwise
(3)
High values of ε will force the robots to explore more
frequently. This prevents the robots from selecting the opti-
mal action all the time, while giving the robots the ability to
react rapidly to changes that takes place in the environment.
Low values of ε will drive the robots to exploit more optimal
actions.
16 2013 IEEE Symposium on Swarm Intelligence (SIS)
The Boltzmann distribution (BD) [7] is a learning policy
that reduces the tendency for exploration as t increases. BD
assigns a positive probability to any possible action at using
Equation (5) and parameter T called temperature [8].
ϕ = Q(s, a)t/T (4)
Pk(s, ak)t =eϕ∑K
k = 1eϕ(5)
Tnew = e−dtTmax + 1 (6)
Actions with high Q(s, a) values are associated with
higher probability ratios. T decreases as t increases. As
learning progresses, the exploration tendency of the swarm
robots reduces and lead to exploitation of actions with high
Q(s, a) values.
The simulated annealing (SA) learning policy was intro-
duced by [9] where the paper reports experimental results
on a 22x17 puzzle problem.
φ = (Q(s, ar)−Q(s, ag))/T (7)
at =
{ar ξ ≤ eφ
ag otherwise(8)
Robots adopting SA learning policy selects an arbitrary
action at and executes it based on the probability defined in
Equation (8). Otherwise the robot selects and executes the
optimal action ag . T is set through trial and error. When T is
high, the SA learning policy promotes more exploration. As
the T drops based on (6), the SA learning policy reduces
the exploration rate and drives the robots to exploit more
optimal solutions.
Koulouriotis et. al. [10] implemented probability matching
(PM) learning policy to examine multi-armed bandit tasks
to solve the EED in comparison with several other learning
policies.
Pmax = 1− (K − 1)Pmin (9)
Pk(s, ak)t = Pmin+(1−K.Pmin)Q(s, ak)t∑K
k = 1Q(s, ak)t(10)
The PM learning policy computes each actions selec-
tion probability Pk(s, ak)t as the proportion of the actions
Q(s, a) value to the sum of all Q(s, a) value estimates. To
enforce a sufficient amount of exploration, the exploration
rate is kept above a threshold of Pmin. Otherwise, an action
which is inefficient in the early time-steps would never be
considered. The exploration rate for each action is also kept
below Pmax.
Adaptive pursuit (AP) learning policy was introduced by
[11]. AP is similar to the PM learning policy as the policy
does not allow probabilities to be under the threshold of
Pmin or exceed the maximum value of Pmax. The AP
learning policy increments the optimal actions probability
towards Pmax using Equation (11) and decrements all re-
maining actions probabilities towards Pmin using Equation
(12).
Pg(s, ag)t = Pg(s, ag)t + α(Pmax − Pg(s, ag)t) (11)
Pi(s, ai)t = Pi(s, ai)t + α(Pmin − Pi(s, ai)t) (12)
R-Max proposed by Brafman et. al. [12] learning policy
maintains an approximate model of the environment. Ini-
tially, the model assumes that all actions to all states lead to
a maximum reward. The model is updated each time a state
becomes known. Using R-Max learning policy, the robot will
always choose the action that is optimal given its computed
Q(s, a)t values. The optimistic values for state-action pairs
visited fewer than C times receives higher priority in terms
of selection. Therefore when an action is to be taken, the
robot will often be encouraged to visit such state-action
pairs.
III. PROPOSED FIFO-LIST LEARNING POLICY
The FIFO-list maintains a list of recently visited states for
the last Nt number of time-steps. States that are included in
the list are not considered as observed states by the mobile-
robots regardless of the Q(s, a)t value associated with the
states.
Each state that has just entered the FIFO-list will remain
in the list for a duration of Nt number of time-steps. The list
follows the FIFO rule. When a new state s1 enters the list,
state sN will be pushed out and state sN−1 will be shifted
to sN in the list. Assuming sn are the states recorded in the
FIFO list and an are the actions leading to states sn, the
robot selects action at based on Equation (13).
at = ag, g �= ∀n (13)
Nt =
⎧⎨⎩
Nt − 1 Q(s, a)nt ew −Q(s, a)ot ld > 0Nt + 1 Q(s, a)nt ew −Q(s, a)ot ld < 0Nt otherwise
(14)
The FIFO list is initialized with Nt empty states. In this
paper, at time-step t = 0, N0 is set to 7. The robot observes
the available states St from the current state st. Actions
leading to the observed states St listed in the FIFO-list
are ignored. The actions leading to the remainder of St
are selected based on Equation (13). If all the observed
states are listed in the FIFO-list, the action at is taken
from Equation (2). The FIFO-list avoids cycling through
the previous sequence of actions an which prompts the
2013 IEEE Symposium on Swarm Intelligence (SIS) 17
robots to explore aggressively. The size of the FIFO-list
Nt changes based on Equation (14). After the action at is
taken, if Q(s, a)newt value is less than Q(s, a)ot ld value,
Nt is incremented by 1 to promote more exploration. If
the updated Q(s, a)nt ew value is higher than Q(s, a)ot ldvalue, Nt is decremented by 1 to promote more exploitation.
The range of value for Nt is kept between Nmin = 2 and
maximum value of Nmax = 15.
IV. PROBLEM MODELING
This section discusses the implementation of the multi-
objective foraging task. The simulation was modeled and
carried out using the Webots software. The Webots software
was used to closely simulate and forecast real world foraging
tasks using Khepera II mobile robots (refer to Figure 1). The
available actions A for the mobile-robots at each observed
state st are move NORTH, move SOUTH, move EAST and
move WEST.
Figure 1. Simulation Environment modeled using Webots software.
Each mobile-robot has a gripper module which is kept
open and the servo is 0◦ to allow the mobile-robots to
detect pucks with the sensor located in the gripper. Puck
availability G in the gripper combined with the X and Zcoordinates defines the available states in the environment.
The rewards observed in the states are recorded in a Q-table.
Each mobile-robot reads and writes into their respective Q-
tables. The Q(s, a) values in the respective Q-tables are
initiated with the value of 0 at the beginning of the first
run.
Before initializing the simulation, the Q(s, a) values in the
Q-tables are set to 0 (except for R-Max). At the initialization
stage, the time-step t is set to 0. 10 pucks and 3 robots are
placed in their respective coordinates in the environment.
Puck availability G decides whether the robot wanders or
performs homing. If the gripper is holding a puck, G is set
to 1 and the mobile-robot will perform homing. If the gripper
of the mobile-robot is not holding a puck, G is set to 0 and
the mobile-robot will learn to wander in the environment
and continue searching for the pucks.
From the current state st, the robot observes the possible
available actions At. Before action at is taken, each robot
measures its corresponding distance from the neighboring
robots. If the distance between the robots are ≤300mm, the
robot will take action at which has the furthest distance
from the neighboring robot overriding the influence of the
learning policy in the decision making. This method is called
repulsion [13], employed as part of the wander behavior
to serve as an additional force preventing the robots from
clustering within an area and also to prevent collision
between the robots in the swarm.
A reward rt is given for the action at taken for the
observed state st from the set of rewards R defined. The
robots are given rt=+100 for picking up a puck from the
environment. This is to motivate the robots to exploit the
state again in the future. rt=+100 is given to the robot for
dropping the pucks in the home location. This will lead the
robot to approach the home location faster as the learning
progresses over time. rt=-1 is given to the mobile-robots for
encountering with collision in the environment. This is done
to prevent further collision occurance in the environment as
the learning process converges. rt=-1 is given to the robot for
wandering in the environment with a puck in the gripper. As
t increases, the mobile-robots will be motivated to approach
the home location at a faster pace. rt=-1 is also given if the
robot are wandering in the home location without a puck in
the gripper. This forces the robot to leave the home location
and approach the environment to search for pucks.
Q-learning [14] is one of the widely used RL algo-
rithm which approximates the optimal action value func-
tion Q(s, a)t independent of the learning policy π being
followed. The action values Q(s, a)t are updated using (15).
Q(s, a)newt ← Q(s, a)oldt +αrt+αγQ(s, a)t+1−αQ(s, a)oldt
(15)
where Q(s, a)t is the estimated Q(s, a) value for the
current state st at time step t, and at, rt, are the action
and reward at time step t. The Q-learning [15] algorithm
is considered to be an off-policy algorithm because the
convergence of the algorithm does not rely on the learning
policy π in updating the value function. The value function
is updated using the highest estimate of Q(s, a)t+1, meaning
the maximum Q(s, a) value of the observed available state-
action pairs Q(s, a)t+1 from the current state st. However if
a greedy learning policy is used with Q- learning, the learn-
ing converges similar to an on-policy learning algorithm.
The respective Q(s, a) values of the robots are updated
using Equation (15). If the at taken leads to collision in the
environment, the robot will move back to their previous state
st. Otherwise the new state st+1 is set as the current state
st. If t �=30000, t is incremented by 1 and the process flow is
18 2013 IEEE Symposium on Swarm Intelligence (SIS)
repeated until all 10 pucks are collected or t = 30000. When
either of the termination condition is met, t is set 0 again
and number of runs are incremented by 1. In this paper, for
each learning policy considered, 10 sets of 40 runs were
conducted.
V. RESULTS AND DISCUSSION
The efficiency of the proposed FIFO-list learning policy
and other learning policies was evaluated by measuring their
performances in searching and retrieving the pucks, collision
avoidance and the amount of area explored. Data from 10
sets of 40 runs for each of the learning policy is presented
graphically in this section. The simulation was carried out
with a swarm of 3 Khepera II mobile robots and a total of
10 pucks to be searched and retrieved. Each run lasted for
30000 time-steps. The number of time-steps and runs was
empirically set as the majority of the learning trials stopped
progressing after these set conditions.
Learning Policy Pucks Collisions Exploration
Greedy 5.92 64.88 72.70%
ε-greedy 6.06 68.47 80.50%
Boltzmann Distribution (BD) 2.80 80.03 87.60%
Simulated Annealing (SA) 3.97 84.85 89.76%
Probability Matching (PM) 6.25 55.40 92.30%
Adaptive Pursuit (AP) 6.75 45.25 95.11%
R-Max 6.40 132.50 98.20%
FIFO-List 7.68 43.13 98.03%
Table IRESULTS SUMMARY.
All the learning policies performed well compared against
the random search. The greedy policy performed very well
in overcoming collision in the environment. Although the
average collisions encountered in the environment averages
to 64.88, it can be observed that as the learning progresses.
The greedy learning policy improves to almost 0 collisions
in the environment. The amount of pucks collected averages
to 5.92 pucks a run.
The greedy learning policy allowed 72.70% exploration
of the states in the environment. This can be expected as
the greedy learning policy does not attempt actions with
low Q(s, a) values. Therefore, once there is a collision,
the state will be undesirable to the robot throughout the
remaining of the simulation period and the upcoming runs.
This obvious weakness contributed to the lack of exploration
in the environment.
The parameter to be set for ε-greedy learning policy is ε,which is the probability for the learning robots to explore.
The best ε setting was found to be 0.2 out of the range of
values tested. With these parameters the ε-greedy learning
policy gathered an average of 6.06 pucks. The amount of
collision averages to 68.47 collision per simulation run.
By setting ε to the value of 0.2, the exploration amount
surpassed that of the greedy learning policy which is about
80.50%. The robots adopting ε-greedy learning policy have
no control over when or how to explore. Therefore, the
robots may chose to take a random action when an obvious
action might lead to a better reward.
For BD learning policy, the parameter Tmax is adjusted
to 500 and the decay rate d is set 0.009. These parameter
settings allows the robots to perform marginally better
incomparison with other tested values. High temperatures
causes the robots to select available actions equally while
in the case of T approaching 0, the robot selects its actions
greedily. Selecting the temperature T is not as straightfor-
ward as the selection of ε in the ε-greedy method as we
must take into consideration of the actions expected Q(s, a)values. However, these parameters can be fine-tuned to be-
have desirably according to the problem setup. BD learning
policy achieved 87.60% in exploration amount, average of
80.03 collisions and average of 2.80 pucks gathered. The
convergence of the BD learning policy was very poor for
the time constraint fixed in this paper.
The SA learning policy [9] performed poorly but better
than BD learning policy. This is because of the poor sepa-
rability of the Q(s, a) values. The initial temperature T is
set to 500. SA learning policyenforces additional amount of
exploration ξ in the beginning of the learning process. This
ratio is much higher for the Boltzmann distribution learning
policy, which decreases exploration with the temperature Tdecayed over time. However, with the same temperature
setting, the learning policy gathered an average of 3.97
pucks, 84.85 collision and 89.76% of exploration in the
environment.
The PM learning policy [10] performed well compared
to SA and BD learning policies. The amount of collision
sums to an average of 55.40 collisions per simulation run
and the amount of pucks gathered averages to 6.25 pucks
per simulation run. 92.30% of the modeled environment is
explored. The PM learning policy maintains a minimum
probability of Pmin which allows all the states to be consid-
ered although the states are valued with low Q(s, a) values.
This characteristic allows a decent amount of exploration
in the environment. Main drawback of PM learning policy
would be due to the fact that regardless of how low the
Q(s, a) values of the observed states are, the probability
would be still be considered because of Pmin. This factor
leads to more collisions and diverses the robots from the
actual goal of the experiment. This learning policy cannot
perform well when the differences between the available
actions Q(s, a) values are relatively small.
Pmin is set to 0.1 for both PM and AP learning policies.
The AP learning policy [11] performed desirably well in
reducing the amount of collisions in the environment with an
average of 45.25 collisions per simulation run. An average
of 6.75 pucks were gathered and 95.11% of the states in
2013 IEEE Symposium on Swarm Intelligence (SIS) 19
the environment were explored. AP learning policy is an
improvement over PM learning policy. The AP learning
policy maintains Pmin similarly to PM but the probabilities
for states with lower Q(s, a) values are biased towards Pmin
while the state with the highest Q(s, a) value is biased
towards Pmax. This characteristic creates a distict gap in
probability between the states allowing the state with the
highest Q(s, a) value to have a higher probability of being
selected. Hence clear improvements can be observed in com-
parison with the PM learning policy in all the objectives set
in this experiment. However, actions leading to states with
low Q(s, a) values are still possible with lower probability
compared to the probability matching learning policy.
Although the R-Max learning policy [12] performed badly
with an average of 132.50 collisions per simulation run,
as the learning process converges the amount of collisions
reduces significantly. The R-Max learning policy maintains
an approximate model of the environment at the beginning
of the learning stage by initializing all the available states
with maximum reward. This characteristic forces the robots
to visit the available states in the environment regardless of
the Q(s, a) value of the states until the states are visited Cnumber of times. This leads to a high amount of collision
in the environment at the beginning of the learning process.
After visiting the available states C amount of times, Rmax
applies ε-greedy learning policy to select between the avail-
able actions. The amount of pucks collected averages to 6.40
pucks per simulation run and 98.20% of the available states
were explored which is highest in comparison with the other
learning policies reported in this paper.
The FIFO-list learning policy outperformed all the learn-
ing policies for collision with an average of 43.13 collisions
a simulation run and amount of pucks gathered averaging
to 7.68 pucks per simulation run. An average of 98.03%of the states in the environment was explored which is
lower compared to R-max learning policy. The learning
robots adopting the proposed method maintains a FIFO list
which prevents the robots from revisiting the recently visited
states for N number of time-steps. This attribute builds a
perimeter that covers a region of recently visited states in the
environment forcing the learning robots to venture further
into the environment. This also allows the robots to avoid
being stuck in a region with locally optimal rewards while
promoting exploration. Higher N value promotes higher
exploration but in return decreases the average amount of
pucks collected and also increases the amount of collision
in the environment. Once the learning process starts to
converge, the list size N maintains its range within 3 to
7.
VI. CONCLUSION
This paper examined the effects of various learning poli-
cies on multi-objective foraging task using swarm robots
and RL techniques. The adopted learning policies from the
open literature are greedy, ε-greedy, Boltzmann distribution,
simulated annealing, probability matching, adaptive pursuit
strategy and R-max. A comparative study was carried out on
each of the learning policies. An improved learning policy,
FIFO-list was introduced in the same problem domain. The
performance of the learning policies were measured based
on the amount of pucks collected, collisions encountered
and explorations made within the simulation period. The
proposed FIFO-list learning policy outperformed all the
other learning policies in collection of pucks and collisions
encountered in the environment. However, the exploration
rate for the proposed method ranks second to the R-max
learning policy although the difference is minimal. Based on
the observed results, it can be concluded that the proposed
FIFO-list learning policy performed very well in comparison
with learning policies in the literature for multi-objective
foraging task using swarm robots.
REFERENCES
[1] M. Brambilla, E. Ferrante, M. Birattari, and M. Dorigo,“Swarm robotics: A review from the swarm engineeringperspective,” IRIDIA, CoDE, Universite Libre de Bruxelles,Brussels, Belgium, Tech. Rep. TR2012-014, 2012.
[2] Y. Cao, A. Fukunaga, and A. Kahng, “Cooperative mobilerobotics: Antecedents and directions,” Autonomous robots,vol. 4, no. 1, pp. 7–27, 1997.
[3] S. Sang-Wook, Y. Hyun-Chang, and S. Kwee-Bo, “Behaviorlearning and evolution of swarm robot system for cooperativebehavior,” in Advanced Intelligent Mechatronics, 2009. AIM2009. IEEE/ASME International Conference on. IEEE, 2009,pp. 673–678.
[4] R. Sutton and A. Barto, Reinforcement learning: An introduc-tion. Cambridge Univ Press, 1998, vol. 1, no. 1.
[5] C. Gaskett, D. Wettergreen, and A. Zelinsky, “Q-learningin continuous state and action spaces,” Advanced Topics inArtificial Intelligence, pp. 417–428, 1999.
[6] O. Kroemer and J. Peters, “Active exploration for robotparameter selection in episodic reinforcement learning,” inAdaptive Dynamic Programming And Reinforcement Learn-ing (ADPRL), 2011 IEEE Symposium on. IEEE, 2011, pp.25–31.
[7] Y. Wang, H. Lang, and C. de Silva, “A hybrid visual servocontroller for robust grasping by wheeled mobile robots,”Mechatronics, IEEE/ASME Transactions on, vol. 15, no. 5,pp. 757–769, 2010.
[8] D. Carmel and S. Markovitch, “Exploration strategies formodel-based learning in multi-agent systems: Explorationstrategies,” Autonomous Agents and Multi-agent Systems,vol. 2, no. 2, pp. 141–172, 1999.
[9] M. Guo, Y. Liu, and J. Malec, “A new q-learning algorithmbased on the metropolis criterion,” Systems, Man, and Cyber-netics, Part B: Cybernetics, IEEE Transactions on, vol. 34,no. 5, pp. 2140–2143, 2004.
20 2013 IEEE Symposium on Swarm Intelligence (SIS)
[10] D. Koulouriotis and A. Xanthopoulos, “Reinforcement learn-ing and evolutionary algorithms for non-stationary multi-armed bandit problems,” Applied Mathematics and Computa-tion, vol. 196, no. 2, pp. 913–922, 2008.
[11] D. Thierens, “An adaptive pursuit strategy for allocatingoperator probabilities,” in Proceedings of the 2005 conferenceon Genetic and evolutionary computation. ACM, 2005, pp.1539–1546.
[12] R. Brafman and M. Tennenholtz, “R-max-a general polyno-mial time algorithm for near-optimal reinforcement learning,”The Journal of Machine Learning Research, vol. 3, pp. 213–231, 2003.
[13] T. Balch, “The impact of diversity on performance in multi-robot foraging,” in Proceedings of the third annual conferenceon Autonomous Agents. ACM, 1999, pp. 92–99.
[14] C. Watkins and P. Dayan, “Q-learning,” Machine learning,vol. 8, no. 3, pp. 279–292, 1992.
[15] M. Kareem Jaradat, M. Al-Rousan, and L. Quadan, “Rein-forcement based mobile robot navigation in dynamic envi-ronment,” Robotics and Computer-Integrated Manufacturing,vol. 27, no. 1, pp. 135–149, 2011.
2013 IEEE Symposium on Swarm Intelligence (SIS) 21