How to contribute tasks to RALPHBench — a rigorous benchmark for evaluating autonomous coding agents on long-horizon SWE tasks.

RALPHBench (Recurrent Agents Learning on Long Programming Horizons) evaluates autonomous coding agents on extremely long-horizon software engineering tasks requiring sustained reasoning over 1,000–10,000+ steps on real production codebases.

Our goal is to build the most rigorous and comprehensive benchmark for measuring agent capabilities on the kind of multi-file, multi-day engineering challenges that real developers face — not just single-step bug fixes or isolated code generation.

RALPHBench evaluates:

Whether agents can sustain coherent reasoning over thousands of steps
How effectively agents handle real production-scale codebases
Whether agents can complete full-system engineering tasks end-to-end

How to Get Involved

Getting Access

Reach out via email: contribute@ralphbench.org
Check the GitHub repo for open issues and contribution guidelines

Getting Started

Read through the CONTRIBUTING.md on GitHub for basic context and orientation
Review existing tasks to understand the format and difficulty expectations

What Tasks We Want

Task Categories

Greenfield Projects (build from scratch):

Compilers, interpreters, and language tools
Web browsers and rendering engines
Databases and storage systems
Full-stack web applications

Performance & ML Research (optimize and innovate):

Performance profiling and optimization
Numerical methods and scientific computing
ML training pipelines and model engineering

Migration & Refactoring (transform existing code):

Framework migrations (e.g., React class → hooks)
API redesigns and versioning
Large-scale multi-file refactors

Task Requirements

Tasks must require 1,000+ steps for a competent agent to complete
Real production codebases preferred over synthetic exercises
Harbor format with oracle solution at 100% pass rate
All tests must be deterministic and reproducible
4-hour wall-clock timeout per task

Task Format

Tasks follow the Harbor format:

task-name/
├── instruction.md          # REQUIRED - Task description
├── task.toml               # REQUIRED - Metadata and timeouts
├── environment/
│   └── Dockerfile          # REQUIRED - Container with dependencies
├── solution/
│   └── solve.sh            # REQUIRED - Oracle solution (must pass 100%)
└── tests/
    ├── test.sh             # REQUIRED - Runs pytest
    └── test_outputs.py     # REQUIRED - Writes reward to /logs/verifier/reward.txt

Evaluation Methodology

Pass@1 Scoring: No partial credit — all tests must pass
Step Counting: Measures agent efficiency (fewer steps = better)
Time Limits: 4-hour wall-clock timeout per task

FAQ

Q: How do I qualify for authorship?

3 high-quality tasks merged to main = automatic authorship

Q: What if I contribute fewer tasks but help with other work?

We consider all contributions: infrastructure, experiments, paper writing, and more. Reach out to discuss!

Resources

Harbor Framework

harbor run --dataset <path> --agent <agent-name>    # run tasks
harbor tasks check                                  # validate task format