in context
reinforcement learning
for language models

Name: ICRL
Author: Stanford Graphics Lab

In-context reinforcement learning improves language models in real time by putting agent's most useful past actions into context for the next task.

Open docs

What is ICRL

Reinforcement learning requires retraining the model with every new experience, a very expensive process that doesn's work for the closed-source frontier and takes weeks to successfully complete. In-Context Reinforcement Learning (ICRL) lets LLM agents improve continuously without any post-training work at all. When an agent successfully completes a task, ICRL stores that trajectory, so the next time a similar task comes up, the agent will retrieve the most relevant past steps and use them as in-context examples to guide its decisions.

When the attention mechanism attends to repeated in context examples, it forms what is functionally a LoRA on top of the base model equivalent to a small fine tune on the in context data 1. We see improvements across a wide range of tasks, from coding to support triage to RLHF type tasks 2.

ICRL works with any LLM provider, even closed-source ones. It ships as a npm and pip package, so you can apply our research right away.

ICRL vs traditional RL

You can get the same improvements from traditional RL with ICRL, at a much cheaper cost, on closed source models, and at all times.

Computational cost

ICRL:Minimal — only storage + retrieval, no training

Traditional:High — full training runs, GPUs, evals

Model capability

ICRL:All models, including closed-source ones

Traditional:Open-source models only

When it improves

ICRL:Immediately, on the next similar task

Traditional:After a full retraining cycle

What changes

ICRL:In-context examples (retrieval memory)

Traditional:Model weights / policy parameters

Infrastructure

ICRL:Lightweight trajectory DB + retrieval

Traditional:GPU training pipelines + eval + rollouts

Feedback latency

ICRL:Instant — same session

Traditional:Batch-delayed — hours to days

Interpretability

ICRL:Explicit — retrievable examples show what worked

Traditional:Implicit — behavior encoded in weights

	ICRL	Traditional RL
Computational cost	Minimal — only storage + retrieval, no training	High — full training runs, GPUs, evals
Model capability	All models, including closed-source ones	Open-source models only
When it improves	Immediately, on the next similar task	After a full retraining cycle
What changes	In-context examples (retrieval memory)	Model weights / policy parameters
Infrastructure	Lightweight trajectory DB + retrieval	GPU training pipelines + eval + rollouts
Feedback latency	Instant — same session	Batch-delayed — hours to days
Interpretability	Explicit — retrievable examples show what worked	Implicit — behavior encoded in weights

Common use cases

Coding agents

Coding workflows and Terminal-Bench style tasks get better from successful trajectories.

examples/harbor_coding_agent.py

Filesystem / task agents

Shell-style environments become more reliable as the agent accumulates experience across repeated goals.

examples/file_api_env.py

Support triage

Routing and reply quality improve as successful triage outputs are retained and reused.

icrl-ts/examples/support-triage-demo.ts

Human-in-the-loop workflows

Human choices between candidate answers become durable training signals for future retrieval.

icrl-ts/web-example

Codebase specific agents

ICRL enables you to specialize an agent to a specific codebase. Using the ICRL CLI, you can launch a terminal UI like Claude Code or Codex that creates an interactive coding assistant in your shell. Unlike static assistants, it gets better over time as successful trajectories are stored and retrieved for future tasks.

uv run icrl chat

The more you use it, the better it gets at understanding your project's patterns, conventions, and architecture, like a custom model fine-tuned to your codebase.

ICRLHF — learning from any feedback signal

ICRL can be used with any feedback signal, not just task success. With ICRLHF, agents can learn from human preferences in real time, reacting to the first thumbs up or down signal they receive.

Human preference

User picks the best of N candidate outputs

Task success / failure

Tests pass, build succeeds, goal reached

Code review signals

PR approved, changes requested, comments

Conversational corrections

User edits, rephrases, or overrides the output

Built out of academic research

This package was built by Stanford Graphics Lab researchers who have been developing ICRL since 2024. You can read more about the research in the following papers:

Self-Generated In-Context Examples Improve LLM Agents for Sequential Decision-Making Tasks (NeurIPS 2025)

Converting successful trajectories into retrieval-time reinforcement improves agent performance. Exceeds gpt-4o-mini→gpt-4o upgrade gains.

arxiv.org/abs/2505.00234

ALFWorld 73%→93%InterCode-SQL 75%→79%Wordcraft 55%→64%

In-Context Distillation with Self-Consistency Cascades: A Simple, Training-Free Way to Reduce LLM Agent Costs

Training-free cost reduction by using a larger model to generate in-context examples for a smaller model. GPT-4.1-mini with distillation exceeds Claude 4.5 Sonnet performance.

arxiv.org/abs/2512.02543

ALFWorld 2.5× cost ↓AppWorld 2× cost ↓

in contextreinforcement learningfor language models