We develop a new algorithm for online planning in large scale sequential decision problems that improves upon the worst case efficiency of UCT. The idea is to augment Monte-Carlo Tree Search (MCTS) with maximum entropy policy optimization, evaluating each search node by softmax values back-propagated from simulation. To establish the effectiveness of this approach, we first investigate the single-step decision problem, stochastic softmax bandits, and show that softmax values can be estimated at an optimal convergence rate in terms of mean squared error. We then extend this approach to general sequential decision making by developing a general MCTS algorithm, Maximum Entropy for Tree Search (MENTS). We prove that the probability of MENTS failing to identify the best decision at the root decays exponentially, which fundamentally improves the polynomial convergence rate of UCT. Our experimental results also demonstrate that MENTS is more sample efficient than UCT in both synthetic problems and Atari 2600 games.
Bibtex
@inproceedings{xiao2019maximum,
title={Maximum Entropy Monte-Carlo Planning},
author={Xiao, Chenjun and Huang, Ruitong and Mei, Jincheng and Schuurmans, Dale and M{\”u}ller, Martin},
booktitle={Advances in Neural Information Processing Systems},
pages={9516–9524},
year={2019}
}
Related Research
-
Our NeurIPS 2021 Reading List
Our NeurIPS 2021 Reading List
Y. Cao, K. Y. C. Lui, T. Durand, J. He, P. Xu, N. Mehrasa, A. Radovic, A. Lehrmann, R. Deng, A. Abdi, M. Schlegel, and S. Liu.
Computer Vision; Data Visualization; Graph Representation Learning; Learning And Generalization; Natural Language Processing; Optimization; Reinforcement Learning; Time series Modelling; Unsupervised Learning
Research
-
Heterogeneous Multi-task Learning with Expert Diversity
Heterogeneous Multi-task Learning with Expert Diversity
G. Oliveira, and F. Tung.
Computer Vision; Natural Language Processing; Reinforcement Learning
Research
-
Desired characteristics for real-world RL agents
Desired characteristics for real-world RL agents
P. Hernandez-Leal, and Y. Gao.
Research