In-context reinforcement learning through bayesian fusion of context and value prior
Episode description
This paper introduces how we can adapt quickly to new tasks without updating model parameters using a framework called SPICE (Shaping Policies In-Context with Ensemble prior), a novel Bayesian In-Context Reinforcement Learning method.Unlike existing models that rely on optimal data, SPICE utilizes a deep ensemble to learn a value prior from suboptimal trajectories and refines this prior at test-time through Bayesian updates. This approach effectively addresses the behavior-policy bias found in traditional supervised learning by using an Upper-Confidence Bound (UCB) rule to encourage principled exploration. Theoretical analysis proves that SPICE achieves optimal regret bounds in both stochastic bandits and finite-horizon environments. Empirical results across various benchmarks confirm that the method is robust under distribution shifts and significantly outperforms prior meta-reinforcement learning approaches. Ultimately, the research offers a scalable framework for deploying reinforcement learning in real-world domains like robotics and autonomous driving where data may be limited or biased.
