Solving the DRP Challenge using Deep Reinforcement Learning
To solve the drone routing problem, I decided to train decentralized but homogenous drone agent policies and a centralised action-value function using the WQMIX (Weighted QMIX) algorithm. WQMIX extends the QMIX algorithm by introducing weighting such that the action-value function can prioritize better joint actions. Although a centralized policy would have probably offered better benchmark performance, the rationale behind adopting decentralized policies was to emulate real-world drone routing dynamics, where drones navigate autonomously based on local information.
Additionally, I leveraged RNNs to capture temporal dependencies and recurrent experience replay (from the R2D2 paper) for prioritizing experiences with higher TD errors.
I'm grateful to the organizers for putting together this challenge; it has been a fantastic learning experience. As a junior researcher, experimenting with different MARL algorithms has deepened my understanding of the field.
Despite testing several SOTA algorithms like WQMIX and optimizing performance with experience replay and hyperparameter tuning, my algorithm was still unable to solve many of the benchmark environments. In hindsight, relying solely on fully decentralized policies may not have been the optimal strategy. I noticed there were instances of coordination failure due to agents being unaware of each other's goals. For example, in this recording, two of the agents rushed to their own goals and blocked the third agent from reaching its goal.
After further research, I realised that this problem is well suited for using Message Passing Neural Networks (MPNN) which enable agents to communicate and coordinate in graph-based environments like this. I'm looking forward to trying out MPNNs for this problem after the competition.
The configurations I have used for training in each map is listed under the configs. Simply run the following command to train a model for a specific environment:
git clone https://github.com/JaydenTeoh/Drone-Routing-WQMIX.git
pip3 install -e ./main
pip3 install -r ./main/requirements.txt
python train.py --map_name=<INCLUDE MAP NAME> --drone_num=<INCLUDE DRONE NUM>
Pretrained models are also available under models. To test the pretrained models, adjust the drone_num
and map_name
variables in the main function of policy_tester.py
accordingly and execute it.
The benchmark scores for the pretrained models can be found in the benchmark_scores.json
file.
ย
In this competition, participants are expected to develop policy/policy.py
, which is essentially a mapping from input(observation
) to output (joint action
) at each step.
-
observation (obs)
: The obs$s^i$ for each drone consists of two parts:current location
andgoal position
. They are in soft-hot representation: the length of this vector$s^i=\left[s_1^i, \ldots, s_j^i, \ldots s_{|V|}^i, s_{|V|+1}^i, \ldots, s_{|V|+j}^i, \ldots s_{|V|*2}^i\right]$ equates to the double of number$|V|$ of the nodes on a map.- It marks a node
$s_j^i$ with 1 if the drone occupies it, while the rest remain zero. - For drones located on the edges, vector values are defined by:
$s_j^i=1-\frac{len\left(l o c^i-v_j^i\right)}{len\left(v_j, v_k\right)}, s_k^i=1-s_j^i$ when drone$i$ traverses edge$\left(v_j, v_k\right)$ , and 0 otherwise. Here,$loc^i=\left(l^{x^i}, l^{y^i}\right)$ represents drone$i$ 's current coordinates and len(,) represents the distance. As drone i approaches node$v_j^i$ , the value of$s_j^i$ increases. - Also, it has Field of View information, which marks a node
$s_j^i$ in onehot with -1 if another drone occupies it.
- It marks a node
-
joint action
: At each step, drones can choose a node to move. Consequently, we represent the action set$A$ using the node set$V$ . It will wait at the current node if a drone choose an non-adjacent nodes. The joint action includes all individual actions from all drones.
Every time each drone takes action, increases step count.
In other words, every time the step
function is excused, the number of steps increases by one.
The episode ends upon conflict, exceeding 100 steps, or all drones reaching goals and restarting with a new environment ( If not specified indications, only the positions of the start and goal change.).
The goal for contribution in this competition is to minimize cost without collision happens.
You can test your developed (policy/policy.py
) by loading it in policy_tester.py
.
Note
Since drp is a gym-standard environment, you can develop it as an usual gym-standard environment without relying on policy_tester.py
we provided. There is an example code by using pfrl.
We use three maps for evaluations: map_3x3
, map_aoba01
, map_shibuya
.
Each map will be evaluated on various drone numbers and various start-goal pairs.
We call one pattern (fixed map, number of drones, and start-goal pair) as a problem and there are a totally of 30 problems which are defined in problem/problems.py
. (Participants are forbidden to alter this file.)
There are three classic patterns to calculate costs as follows.
The final cost is the sum of the costs of the 30 problems (more details). The objective is to minimize this final cost
Once your (policy/policy.py
) has been deployed, you can run calculate_cost.py
, which will outputs a json file (your_team_name.json
) including the cost (named final cost
).
Please refer to this page to get more detailed information about the DRP environment.