Skip to main content

Overview

The Harbor coding agent demonstrates ICRL’s performance improvement on realistic software engineering tasks. It uses a simulated coding workspace (/workspace/src, /workspace/tests) with shell-like commands for editing, navigation, and testing. The demo shows baseline vs post-training behavior on verifiable coding tasks.

Source Files

FilePurpose
examples/harbor_coding_agent.pyMain demo script with CodingEnvironment
tests/test_harbor_coding.pyPytest tests for the demo

Run Demo

OPENAI_API_KEY=... uv run python examples/harbor_coding_agent.py
Or with Anthropic:
ANTHROPIC_API_KEY=... uv run python examples/harbor_coding_agent.py
Optional model override:
MODEL=gpt-4o-mini uv run python examples/harbor_coding_agent.py

Run Tests

uv run pytest tests/test_harbor_coding.py -v

What It Demonstrates

  • Coding workspace simulation — Sandboxed /workspace with src/ and tests/
  • Shell-like commandsls, cd, cat, grep, find, sed, echo, pytest, etc.
  • Baseline vs post-training — Phase 1 runs evaluation tasks with no examples; Phase 2 trains on coding tasks; Phase 3 re-evaluates with learned examples
  • Verifiable tasks — Each task has a verify(workspace_state) -> bool function

Task Structure

Tasks are CodingTask objects with:
  • goal — Natural language description
  • verify(workspace_state) -> bool — Success predicate
  • setup — Optional hook to modify initial state
  • difficulty"easy", "medium", "hard"
  • category"code-analysis", "debugging", "navigation", "testing", "refactoring"
This structure is suitable for repeatable benchmarking workflows and aligns with Harbor/Terminal-Bench 2.0 style evaluations.