Target-directed agents utilize self-generated targets, to guide their behaviors for better generalization. These agents are prone to blindly chasing problematic targets, resulting in worse generalization and safety catastrophes. We show that these behaviors can be results of delusions, stemming from improper designs around training: the agent may naturally come to hold false beliefs about certain targets. We identify different types of delusions via intuitive examples in controlled environments, and investigate their causes and mitigations. With the insights, we demonstrate how we can make agents address delusions preemptively and autonomously. We validate empirically the effectiveness of the proposed strategies in correcting delusional behaviors and improving out-of-distribution generalization.
Bibtex
@misc{zhao2024identifyingaddressingdelusionstargetdirected,
title={Identifying and Addressing Delusions for Target-Directed Decision-Making},
author={Mingde Zhao and Tristan Sylvain and Doina Precup and Yoshua Bengio},
year={2024},
eprint={2410.07096},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2410.07096},
}
Related Research
-
Leveraging Environment Interaction for Automated PDDL Generation and Planning with Large Language Models
Leveraging Environment Interaction for Automated PDDL Generation and Planning with Large Language Models
S. Mahdavi, R. Aoki, K. Tang, and Y. Cao. Conference on Neural Information Processing System (NeurIPS)
Publications
-
Jump Starting Bandits with LLM-Generated Prior Knowledge
Jump Starting Bandits with LLM-Generated Prior Knowledge
P. A. Alamdari, Y. Cao, and K. Wilson. Conference on Empirical Methods in Natural Language Processing
Publications
-
AdaFlood: Adaptive Flood Regularization
AdaFlood: Adaptive Flood Regularization
W. Bae, Y. Ren, M. O. Ahmed, F. Tung, D. J. Sutherland, and G. Oliveira. Transactions on Machine Learning Research (TMLR)
Publications