TLDR : Residual RL is a powerful strategy for adapting a pretrained base policy to new environments but standard Residual RL struggles with uncontrolled exploration and stochastic base policies. We propose two improvements to Residual RL that leverage uncertainty estimation to contain exploration and introduce an asymmetric actor-critic algorithm for off-policy learning.
Residual Reinforcement Learning (RL) is a popular approach for adapting pretrained policies by learning a lightweight residual policy that provides corrective actions. While Residual RL is more sample-efficient than finetuning the entire base policy, existing methods struggle with sparse rewards and are designed for deterministic base policies. We propose two improvements to Residual RL that further enhance its sample efficiency and make it suitable for stochastic base policies. First, we leverage uncertainty estimates of the base policy to focus exploration on regions in which the base policy is not confident. Second, we propose a simple modification to off-policy residual learning that allows it to observe base actions and better handle stochastic base policies. We evaluate our method with both Gaussian-based and Diffusion-based stochastic base policies on tasks from Robosuite and D4RL, and compare against state-of-the-art finetuning methods, demo-augmented RL methods, and other Residual RL methods. Our algorithm significantly outperforms existing baselines in a variety of simulation benchmark environments. We also deploy our learned policies in the real world to demonstrate their robustness with zero-shot sim-to-real transfer.
Our first idea is to control the exploration around base policy using uncertainty estimation :
✅ Let it act autonomously when certain
⚡ Activate the residual only when needed
Our second idea is to optimize the Residual RL algorithm for stochastic policies using an asymmetric actor critic approach
where we learn the critic for the combined action (i.e. base action + residual action) and the actor for
only the residual action.
Our findings reveal that using this architecture yields significant performance improvement for stochastic base policies but both architectures work for deterministic base policies.
We conduct experiments with two kinds of uncertainty metrics (distance-to-data and ensemble variance) with both Gaussian mixture model and diffusion based base policies across multiple tasks and environments. We notice strong performance of our method over direct finetuning methods and other Residual RL methods.
Diffusion base policy
GMM base policy
Sim-to-Real-Transfer
We also evaluate the learned policies in real world using sim-to-real transfer.
@ARTICLE{11267054,
author={Dodeja, Lakshita and Schmeckpeper, Karl and Vats, Shivam and Weng, Thomas and Jia, Mingxi and Konidaris, George and Tellex, Stefanie},
journal={IEEE Robotics and Automation Letters},
title={Accelerating Residual Reinforcement Learning With Uncertainty Estimation},
year={2026},
volume={11},
number={1},
pages={970-977},
keywords={Uncertainty;Stochastic processes;Reinforcement learning;Imitation learning;Training;Robustness;Tuning;Transforms;Training data;Robot control;Reinforcement learning (RL);deep learning methods;machine learning for robot control},
doi={10.1109/LRA.2025.3636808}}