RALPHBench (Recurrent Agents Learning on Long Programming Horizons) evaluates autonomous coding agents on extremely long-horizon software engineering tasks requiring sustained reasoning over 1,000–10,000+ steps on real production codebases.
Our goal is to build the most rigorous and comprehensive benchmark for measuring agent capabilities on the kind of multi-file, multi-day engineering challenges that real developers face — not just single-step bug fixes or isolated code generation.
RALPHBench evaluates:
- Whether agents can sustain coherent reasoning over thousands of steps
- How effectively agents handle real production-scale codebases
- Whether agents can complete full-system engineering tasks end-to-end
How to Get Involved
Getting Access
- Reach out via email: contribute@ralphbench.org
- Check the GitHub repo for open issues and contribution guidelines
Getting Started
- Read through the CONTRIBUTING.md on GitHub for basic context and orientation
- Review existing tasks to understand the format and difficulty expectations
What Tasks We Want
Task Categories
Greenfield Projects (build from scratch):
- Compilers, interpreters, and language tools
- Web browsers and rendering engines
- Databases and storage systems
- Full-stack web applications
Performance & ML Research (optimize and innovate):
- Performance profiling and optimization
- Numerical methods and scientific computing
- ML training pipelines and model engineering
Migration & Refactoring (transform existing code):
- Framework migrations (e.g., React class → hooks)
- API redesigns and versioning
- Large-scale multi-file refactors
Task Requirements
- Tasks must require 1,000+ steps for a competent agent to complete
- Real production codebases preferred over synthetic exercises
- Harbor format with oracle solution at 100% pass rate
- All tests must be deterministic and reproducible
- 4-hour wall-clock timeout per task
Task Format
Tasks follow the Harbor format:
task-name/
├── instruction.md # REQUIRED - Task description
├── task.toml # REQUIRED - Metadata and timeouts
├── environment/
│ └── Dockerfile # REQUIRED - Container with dependencies
├── solution/
│ └── solve.sh # REQUIRED - Oracle solution (must pass 100%)
└── tests/
├── test.sh # REQUIRED - Runs pytest
└── test_outputs.py # REQUIRED - Writes reward to /logs/verifier/reward.txtEvaluation Methodology
- Pass@1 Scoring: No partial credit — all tests must pass
- Step Counting: Measures agent efficiency (fewer steps = better)
- Time Limits: 4-hour wall-clock timeout per task
FAQ
Q: How do I qualify for authorship?
3 high-quality tasks merged to main = automatic authorship
Q: What if I contribute fewer tasks but help with other work?
We consider all contributions: infrastructure, experiments, paper writing, and more. Reach out to discuss!
Resources
Harbor Framework
harbor run --dataset <path> --agent <agent-name> # run tasks
harbor tasks check # validate task format