From Experience to Reasoning: Offline RL Subroutines and LLM-Based Grounding for Sample-Efficient Reinforcement Learning

67 views

Thursday, March 19, 2026 - 11:40 am

Online/Room 2267, Storey Innovation Center

DISSERTATION DEFENSE

Author : Jianhai Su

Advisor: Dr. Qi Zhang

Date: March 19, 2026

Time: 11:40 am- 1:40 pm (ET)

Place: Online/Room 2267, Storey Innovation Center

Remote join (Microsoft Teams):

Link: https://teams.microsoft.com/meet/22389270607188?p=XPFrAyxA5Qo0IIh3tV

Meeting ID: 223 892 706 071 88

Passcode: YC7bg7zH

Abstract

Improving the learning efficiency of reinforcement learning (RL) agents remains a fundamental challenge, particularly in environments characterized by sparse rewards, long horizons, or partial observability. This dissertation investigates how RL agents can learn more efficiently through two complementary forms of guidance: mechanisms derived purely from an agent’s own experience and mechanisms that leverage reasoning priors from pretrained large language models (LLMs).

On the experience-driven side, the first study develops a general framework for incorporating offline RL algorithms as subroutines within an online RL process. In this framework, an agent periodically repurposes its replay buffer as an offline dataset and applies offline optimization methods such as Implicit Q-Learning (IQL) or Calibrated Q-Learning (Cal-QL). Through systematic empirical analysis across diverse benchmark environments, this study characterizes when such experience-driven guidance improves policy quality under fixed interaction budgets and identifies several practical factors that influence its effectiveness.

On the LLM-based side, the dissertation presents two complementary grounding approaches. The second study investigates implicit grounding, where a Flamingo-style vision–language model with an embedded pretrained language model acts as the high-level policy in a hierarchical RL agent. The agent processes multimodal interaction histories and proposes subgoals for a library of pretrained low-level skills, grounding pretrained language priors through policy learning.

The third study introduces an explicit grounding framework in which reasoning traces produced by an external LLM are distilled into a latent reasoning module within a value-based RL agent. A potential function defined over this latent space is then learned from the agent’s trajectories and used for potential-based reward shaping. This dual-track framework combines reasoning transfer with interaction-driven learning to improve both learning efficiency and final policy performance.

Together, these studies provide a structured investigation of how experience-driven learning and LLM-based grounding—both implicit and explicit—can guide reinforcement learning under realistic interaction constraints and offer practical insights for designing more sample-efficient RL agents.

Jobs Board