Find in Library
Search millions of books, articles, and more
Indexed Open Access Databases
Multi-Agent Planning under Uncertainty with Monte Carlo Q-Value Function
oleh: Jian Zhang, Yaozong Pan, Ruili Wang, Yuqiang Fang, Haitao Yang
| Format: | Article |
|---|---|
| Diterbitkan: | MDPI AG 2019-04-01 |
Deskripsi
Decentralized partially observable Markov decision processes (Dec-POMDPs) are general multi-agent models for planning under uncertainty, but are intractable to solve. Doubly exponential growth of the search space as the horizon increases makes a brute-force search impossible. Heuristic methods can guide the search towards the right direction quickly and have been successful in different domains. In this paper, we propose a new Q-value function representation—Monte Carlo Q-value function <inline-formula> <math display="inline"> <semantics> <mrow> <msub> <mi mathvariant="normal">Q</mi> <mrow> <mi>MC</mi> </mrow> </msub> </mrow> </semantics> </math> </inline-formula>, which is proved to be an upper bound of the optimal Q-value function <inline-formula> <math display="inline"> <semantics> <mrow> <msup> <mi mathvariant="normal">Q</mi> <mo>*</mo> </msup> </mrow> </semantics> </math> </inline-formula>. We introduce two Monte Carlo tree search enhancements—heavy playout for a simulation policy and adaptive samples—to speed up computation of <inline-formula> <math display="inline"> <semantics> <mrow> <msub> <mi mathvariant="normal">Q</mi> <mrow> <mi>MC</mi> </mrow> </msub> </mrow> </semantics> </math> </inline-formula>. Then, we present a clustering and expansion with Monte-Carlo algorithm (CEMC)—an offline planning algorithm using <inline-formula> <math display="inline"> <semantics> <mrow> <msub> <mi mathvariant="normal">Q</mi> <mrow> <mi>MC</mi> </mrow> </msub> </mrow> </semantics> </math> </inline-formula> as Q-value function, which is based on the generalized multi-agent A* with incremental clustering and expansion (GMAA*-ICE or ICE). CEMC calculates Q-value functions as required, without computing and storing all Q-value functions. An extended policy pruning strategy is used in CEMC. Finally, we present empirical results demonstrating that CEMC outperforms the best heuristic algorithm with a compact Q-value presentation in term of runtime for the same horizon, and has less memory usage for larger problems.