in context
reinforcement learning
for language models
In-context reinforcement learning improves language models in real time by putting agent's most useful past actions into context for the next task.
What is ICRL
Reinforcement learning requires retraining the model with every new experience, a very expensive process that doesn's work for the closed-source frontier and takes weeks to successfully complete. In-Context Reinforcement Learning (ICRL) lets LLM agents improve continuously without any post-training work at all. When an agent successfully completes a task, ICRL stores that trajectory, so the next time a similar task comes up, the agent will retrieve the most relevant past steps and use them as in-context examples to guide its decisions.
When the attention mechanism attends to repeated in context examples, it forms what is functionally a LoRA on top of the base model equivalent to a small fine tune on the in context data 1. We see improvements across a wide range of tasks, from coding to support triage to RLHF type tasks 2.
ICRL works with any LLM provider, even closed-source ones. It ships as a npm and pip package, so you can apply our research right away.
ICRL vs traditional RL
You can get the same improvements from traditional RL with ICRL, at a much cheaper cost, on closed source models, and at all times.
Computational cost
Model capability
When it improves
What changes
Infrastructure
Feedback latency
Interpretability
| ICRL | Traditional RL | |
|---|---|---|
| Computational cost | Minimal — only storage + retrieval, no training | High — full training runs, GPUs, evals |
| Model capability | All models, including closed-source ones | Open-source models only |
| When it improves | Immediately, on the next similar task | After a full retraining cycle |
| What changes | In-context examples (retrieval memory) | Model weights / policy parameters |
| Infrastructure | Lightweight trajectory DB + retrieval | GPU training pipelines + eval + rollouts |
| Feedback latency | Instant — same session | Batch-delayed — hours to days |
| Interpretability | Explicit — retrievable examples show what worked | Implicit — behavior encoded in weights |
Common use cases
Coding agents
Coding workflows and Terminal-Bench style tasks get better from successful trajectories.
examples/harbor_coding_agent.pyFilesystem / task agents
Shell-style environments become more reliable as the agent accumulates experience across repeated goals.
examples/file_api_env.pySupport triage
Routing and reply quality improve as successful triage outputs are retained and reused.
icrl-ts/examples/support-triage-demo.tsHuman-in-the-loop workflows
Human choices between candidate answers become durable training signals for future retrieval.
icrl-ts/web-exampleCodebase specific agents
ICRL enables you to specialize an agent to a specific codebase. Using the ICRL CLI, you can launch a terminal UI like Claude Code or Codex that creates an interactive coding assistant in your shell. Unlike static assistants, it gets better over time as successful trajectories are stored and retrieved for future tasks.
uv run icrl chat
The more you use it, the better it gets at understanding your project's patterns, conventions, and architecture, like a custom model fine-tuned to your codebase.
ICRLHF — learning from any feedback signal
ICRL can be used with any feedback signal, not just task success. With ICRLHF, agents can learn from human preferences in real time, reacting to the first thumbs up or down signal they receive.
Human preference
User picks the best of N candidate outputs
Task success / failure
Tests pass, build succeeds, goal reached
Code review signals
PR approved, changes requested, comments
Conversational corrections
User edits, rephrases, or overrides the output
Built out of academic research
This package was built by Stanford Graphics Lab researchers who have been developing ICRL since 2024. You can read more about the research in the following papers:
Self-Generated In-Context Examples Improve LLM Agents for Sequential Decision-Making Tasks (NeurIPS 2025)
Converting successful trajectories into retrieval-time reinforcement improves agent performance. Exceeds gpt-4o-mini→gpt-4o upgrade gains.
arxiv.org/abs/2505.00234In-Context Distillation with Self-Consistency Cascades: A Simple, Training-Free Way to Reduce LLM Agent Costs
Training-free cost reduction by using a larger model to generate in-context examples for a smaller model. GPT-4.1-mini with distillation exceeds Claude 4.5 Sonnet performance.
arxiv.org/abs/2512.02543