Non-linear policy graph improvement algorithm for finding policies for decentralized partially observable Markov decision processes, or Dec-POMDPs.
A Dec-POMDP is a decentralized decision-making problem under uncertainty, where a team of agents acts jointly to maximize the expected sum of shared rewards. The objective is to find for each agent a policy that tells how that agent should act to maximize the shared reward.
The algorithm implemented here is suitable for standard Dec-POMDPs and ones that have an information-theoretic reward function, that is, the reward depends on the joint information state of the agents. Such Dec-POMDPs can model decentralized information gathering tasks, such as monitoring or surveillance tasks.
The algorithm represents policies as finite state controllers, or policy graphs. The policy graphs are iteratively improved by finding the action labels of nodes, and the configuration of the out-edges corresponding to observations.
You can find a high-level summary of this work from this blog post.
The follow-up to this work is Multi-agent active perception with prediction rewards that you may be interested to check out as well.
For more background information on the algorithm and Dec-POMDPs with information-theoretic rewards, you can refer to the publications below. If you find this repository useful, please consider citing them.
- Mikko Lauri, Joni Pajarinen, Jan Peters. Multi-agent active information gathering in discrete and continuous-state decentralized POMDPs by policy graph improvement, Autonomous Agents and Multi-Agent Systems, 42(34), 2020.
BiBTeX:
@article{Lauri2020multiagent,
author = {Lauri, Mikko and Pajarinen, Joni and Peters, Jan},
title = {Multi-agent active information gathering in discrete and continuous-state decentralized POMDPs by policy graph improvement},
year = {2020},
volume = {34},
number = {42},
year = {2020},
doi = {10.1007/S10458-020-09467-6},
journal = {Autonomous Agents and Multi-Agent Systems}
}
- Mikko Lauri, Joni Pajarinen, Jan Peters. Information gathering in decentralized POMDPs by policy graph improvement, in 18th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2019)
Link to conference proceedings You can find a preprint on arXiv.
BiBTeX:
@inproceedings{Lauri2019information,
author = {Mikko Lauri and Joni Pajarinen and Jan Peters},
title = {Information gathering in decentralized POMDPs by policy graph improvement},
booktitle = {Autonomous Agents and Multiagent Systems (AAMAS)},
year = 2019
}
Your compiler must support C++14.
Required system libraries:
- eigen3
- Boost libraries: program_options regex graph
Required external libraries:
- MADP toolbox will be installed automatically, or you can provide a path where to find the library if you have it already by
cmake
options-DMADPPATH=/path/to/madp
. If you need a different version than the auto-installed one, modify theGIT_TAG
inCMakeLists.txt
or use theDMADPPATH
variable to point to your preferred installation.
mkdir build && cd build
cmake ..
make
You can use the xdot
viewer tool to visualize the policy graphs.
The MAV and Information gathering rovers domains from the AAMAS paper are included in the subfolder problems
.
After you have compiled the software, you can try to find a policy for the classical Tiger Dec-POMDP.
We can try with a random seed equal to 100 with the option -s
and run 100 improvement steps on the policy graphs by the -i
option:
./NPGI -s 100 -i 100 ../problems/dectiger.dpomdp
You will see some outputs with policy values and timings. The software will create some files in the current directory:
best_policy_agentI.dot
the best policy for agentI
in GraphViz DOT formatbest_value.txt
the value of the best joint policy found by applying best policy graphs for each agentstepXYZ_agentI.dot
the policy graph of agentI
after improvement stepXYZ
in the GraphViz DOT Formatpolicy_values.txt
the values of policies found at each improvement stepduration_microseconds.txt
timing for how long each improvement step took in microseconds
The software supports using negative Shannon information entropy as the final step reward.
This can be useful for information gathering tasks such as those discussed in the paper.
If you want to solve a Dec-POMDP that uses negative entropy as the reward on the final step, just run NPGI
with the option -e
.
For example, we can solve the MAV problem from the AAMAS paper for a horizon of 3 steps, with policy graph width of 2.
First, remember to first uncompress the problem file in the problems
folder!
Then run:
./NPGI -e -h 3 -w 2 -i 30 ../problems/mav.dpomdp
The outputs are as in the usual Dec-POMDP case.
The software only supports using the negative entropy of the belief state as a final reward.
If you want to modify the information reward, you can do so by modifying the two functions named final_reward
in the file RewardMatrix.hpp
.
You can look at the files solve_graphsensing.cpp
, GraphSensingProblem.h
and GraphSensingProblem.cpp
to see how you can define your own problem with a continuous state.
More details on this problem and the algorithm variant that solves them is provided in the journal article linked above.
- Mikko Lauri - University of Hamburg, Hamburg, Germany
Licensed under the Apache 2.0 license - see LICENSE for details.
- Joni Pajarinen for providing a reference implementation of the original policy graph improvement algorithm
- We use the parser from the excellent MADP toolbox to read .dpomdp files.
- We use combinations_cpp by Artem Amirkhanov to enumerate combinations.