Our AI needs to drive responsible actions and decision-making in rapidly changing environments. With the Causmos program at RBC Borealis, we aim to build machine intelligence beyond predictive ML for financial services by conducting research in areas such as causality, out-of-distribution (OOD) generalization, reasoning and planning in large language models (LLMs) and reinforcement learning.

The Tale of Three Actors

A middle-aged Caucasian woman standing in a cozy office of a motel in Banff, Canada, examining a large wall chart displaying historical data.
A cartoon image of a friendly robot with a large screen on its chest. The screen displays rows of various recommended items like books, and electronics.
Illustration of autonomous humanoid robots on the surface of Mars.

What do a motel owner in a small town near Banff, a recommendation system and future autonomous agents (perhaps powered by LLMs) for a Mars colonization have in common? The answer: they all need to make decisions and take actions directly or via mediation that affect the real world. These actions require them to have a model of their respective domains beyond correlation.

The motel owner would know that despite the correlation of high price with low vacancy, it is due to confounding by the higher demand in peak seasons. Raising the price will not help fill rooms generally. Training a recommendation system must account for selection bias in the data introduced by any previous system’s policy and also learn from the outcomes of its own doings. Finally, the LLM-powered autonomous agent acting in the new world must be able to plan its actions beyond scenarios similar to existing observed or simulated data.

Likewise, in the financial services sector, ensuring our AI systems drive the right decisions for our clients amidst complex, evolving landscapes is crucial. Even when human oversight is integrated as a safety net in mission-critical tasks, AI influences real-world outcomes through human mediation. The Causmos research program at RBC Borealis is dedicated to advancing machine intelligence beyond predictive ML in this domain. Our focus encompasses different facets and approaches to model actions and their consequences, including reinforcement learning, planning, causal inference, and experimental design. Additionally, we delve into crucial related areas to ensure the safety of actions, such as reasoning, interpretability, adversarial robustness, and out-of-distribution (OOD) generalization. For some of our relevant publications, please see the list at the end.

Reinforcement Learning, Planning and Causal Inference

One way to comparatively understand the different approaches to intelligence for actions is through the lens of how much-structured knowledge of the world is used and where the knowledge is coming from.

The simplest and most classical framework for learning to act is arguably the bandits. This framework captures the essence of learning about the environment through exploration and using that knowledge to maximize the effectiveness of exploitation. The hallmark of bandits is their minimal reliance on prior or learned knowledge. Indeed, Bandits assume a stateless environment, and in the multi-armed bandit setting, the only information learned is some (partial) knowledge of the reward distribution of different levers. Extending the same core ideas to sequential decision-making yields model-free reinforcement learning (RL). Through essentially the same trial-and-error, model-free RL algorithms decipher the “goodness” of states without explicitly modelling the mechanisms of the environment. Bandits and model-free RL are particularly suitable when the feedback loop is quick and the cost of exploration is relatively low. This characteristic renders them apt for scenarios where the task at hand can be learned through direct interaction with the environment, eschewing the necessity for an elaborate world model. Most Internet platforms use some kind of bandit or model-free RL for their advertising or recommendation platforms (for example, Netflix’s use of multi-arm bandit and Google’s use of RL in YouTube recommendations). In the world of financial services, RBC Borealis and RBC Capital Market built a highly successful RL-based trading agent for market-making called Aiden.

The downside of this model-free approach is that solving hard problems with- out structured knowledge of the world would necessitate lots of data through trial-and-error exploration. This high sample complexity quickly becomes infeasible in settings where feedback from the environment is sparse, delayed, costly or if certain exploration can be unsafe. In financial services, an example is the pricing of products, where outcomes require human negotiation to play out. This is where planning shines if a good domain model is available. In problems where the states are discrete and fully observed, the transition dynamics are deterministic; planning is equivalent to finding a path in a graph that traverses from the initial state node to the goal state node. In continuous problems, this is akin to trajectory optimization and control. There is little need for trial and error exploration while dangerous or desirable states can be avoided. There are variants of the problem with relaxed requirements and corresponding planning algorithms, but generally, a complete and accurate domain model is still needed, even if it is probabilistic. However, engineering robust domain models or simulators is challenging and hard to scale with the complexity of the domain. As we transition to open-domain problems, the feasibility of such an engineered domain model wanes, accentuating the importance of learning a competent world model. There are proposals to learn the domain models as part of the RL agent, giving rise to model-based RL. Despite its appealing premise and its introduction quite some time ago (for instance, the renowned Dyna architecture (Sutton, 1991)), model-based RL has yet to achieve widespread success.

Outside the RL and planning literature, causal inference represents an alternative to studying the effect of actions, i.e. treatments in this context. Tools of causal inference leverage domain knowledge and limited data to separate causation from mere correlation. Through meticulously designed experiments or observational studies, it endeavours to infer how the world might react to various treatments. In causal inference, there is a much stronger reliance on domain knowledge, usually imparted by humans. Still, it can be learned from data via causal discovery, given some human-supplied assumptions. Unlike the model-free RL, causal inference hinges on a more structured understanding of the domain. And unlike planning, causal inference does not require a complete model of the domain but only what is necessary to complement the data to answer specific causal queries. On the one hand, this allows for more informed actions with less data; conversely, CI requires more human design effort and is typically less autonomous. Finally, the formulation in CI is not centred around the notion of reward but can be combined with a decision framework for optimizing actions. For example, in causal bandit learning, inferring some structural information about the bandit’s arms makes bandit learning probably more efficient. More generally, there is an emergent field of causal reinforcement learning. Finally, there is an interesting new class of model called Generative Flow Networks (GFlowNets) (Bengio et al., 2021), which allows neural networks to learn from rewards like in RL to model graph structures like causal graphs.

In summary, bandits and model-free RL perform well in domains with a high volume of rapid feedback, and the cost associated with exploration is minimal. However, in certain areas of financial services, regulatory constraints or other limitations often result in actions that yield delayed or sparse feedback, making exploration costly or restricted. In such scenarios, it is unsurprising that having a deeper understanding of the domain reduces the need for trial-and-error exploitation. We have achieved some success in building systems capable of making intelligent decisions despite encountering delayed or sparse feedback, which we hope to reveal and delve deeper into in the future. Figure 2 summarizes the comparisons of the different types of technologies for intelligent actions.

Figure 2: Comparing the different frameworks for intelligent actions on four dimensions: (1) the amount of domain/world knowledge used (x-axis); (2) the degree of autonomy versus human design required for building the system (y- axis); (3) the maturity of the technology (colour) and (4) the generality or potential generality of the technology (size). Before LLMs, among methods that have found some commercial successes, unsurprisingly, there is a negative correlation between the amount of domain knowledge the method requires and the degree of autonomy. These technologies are all limited in their generality. Bandit and model-free RL are not sample efficient enough to be applicable in scenarios with slow or sparse feedback. Traditionally, both causal inference and planning require too much human design effort in complex problems. So, the solution would not be commercially viable unless the problem is extremely valu- able. Model-based RL has promised to break this pattern, but the technology does not work well enough yet.

Figure 2: Comparing the different frameworks for intelligent actions on four dimensions: (1) the amount of domain/world knowledge used (x-axis); (2) the degree of autonomy versus human design required for building the system (y-axis); (3) the maturity of the technology (colour) and (4) the generality or potential generality of the technology (size). Before LLMs, among methods that have found some commercial successes, unsurprisingly, there is a negative correlation between the amount of domain knowledge the method requires and the degree of autonomy. These technologies are all limited in their generality. Bandit and model-free RL are not sample efficient enough to be applicable in scenarios with slow or sparse feedback. Traditionally, both causal inference and planning require too much human design effort in complex problems. So, the solution would not be commercially viable unless the problem is extremely valuable. Model-based RL has promised to break this pattern, but the technology is not yet working well.

LLMs and Autonomous Agents

With recent advancements, Large Language Models (LLMs) can now seamlessly leverage an extensive reservoir of knowledge and skills. This has ignited a new-found hope that LLM-based autonomous agents can formulate plans to tackle intricate problems, utilizing their pre-existing knowledge acquired through pre-training or acquiring new skills on the fly through in-context learning in novel situations. Furthermore, using text as the unified communication interface, AI agents fulfilling human requests can formulate their own subgoals, plan accordingly, and perform various tasks toward achieving the human-specified goal, all by composing calls to LLMs and external tools. Or so it seems.

Underneath the excitement generated by ambitious demos and rapid progress from academia, industry and hobbyists alike, there lurks fundamental challenges. We see at least three core issues as of today:

  1. restrictions from LLMs’ communication bandwidth;
  2. inherent limitations in LLMs’ reasoning and planning ability;
  3. misalignment between LLMs’ representation of knowledge and states with that of humans or reality.

For the first issue, LLM agents must retrieve external data, read and write to external memory, and use tools. All of these require them to communicate via a text interface with a finite context window. But both the length of the messages and the information density per unit length of the message are limited. Additionally, many advanced approaches for enhancing reasoning and planning involve layers of LLMs calls, such as Tree-of-Thought (Yao et al., 2023a), Graph-of-Thought (Besta et al., 2023), ReAct (Yao et al., 2023b), Reflexion (Shinn et al., 2023). Therefore, the limitation in communication bandwidth makes it hard for the agent to maintain consistency of its reasoning and actions without losing local nuances or global context when performing complex tasks. With the rapid progress in scaling the context window size, this is likely the easiest problem to resolve. At the time of writing, GPT-4-Turbo has a 32K token context size. However, it remains to see whether the long but finite communication bandwidth is sufficient when LLM agents need to tackle complex real-world challenges that could require a variable number of steps.

For the second issue, several studies have shown that even the most powerful LLMs can fail to extrapolate to longer tasks involving composing or repeating the same simple steps, for example, in arithmetic (Dziri et al., 2023). While arithmetic can be solved by giving LLMs access to a calculator (Schick et al., 2023), in the context of planning complex actions, the same underlying issue means LLMs can fail to track state dynamics properly. Understanding the root cause of such limitation, whether it is the Transformer architecture, the auto-regressive nature of training and inference, the data, or something else, is an open problem and active area of research.

Finally, despite efforts to align LLMs for safety and usefulness for chat, such alignment is just guard-railing LLMs’ interaction with humans. It may not correct fundamental mistakes in how its representation deviates from the true causality of the world or the representation of humans. For example, Zeˇcevi ́c et al. (2023) showed theoretically and Jin et al. (2023) empirically that today’s LLMs do not understand causality. The deviation from causal reality means that LLMs will learn shortcuts and spurious correlations that fail in out-of-distribution settings, as shown for mathematical reasoning by Stolfo et al. (2023). Deviation from humans’ representation means they are prone to adversarial attacks (Zou et al., 2023), a failure mode that humans do not fall prey to.

Closing Remarks

From the motel owner’s intuitive grasp of their trade to the recommendation systems that permeate the digital realm to the awe-inspiring prospects of autonomous Martian colonization, one simple truth remains evident: intelligent actions need world knowledge. On the other hand, it is fair to ask: does knowledge need action? In other words, can robust world knowledge be acquired through passive observations alone? Or would it be fundamentally flawed without training on the outcomes of one’s own actions?

In restricted domains, fitting curves with the correct hypothesis class could potentially yield perfect OOD generalization. In the open domain, with the success of LLMs pre-trained on massive corpora, many people believe that scaling, perhaps with multi-modal data, is “all you need” for AI, while others think there are fundamental flaws in the current approach. Time and more research will tell, but one thing is sure: answers to questions like this have profound im- plications for both the long-term roadmap of the AI community and immediate applications that demand robustness, such as in financial services.

Appendix: Selected Relevant Borealis Work

RL & Planning

Interpretability and disentanglement

Reasoning and OOD generalization

Adversarial learning and robustness