Contributing

How to contribute tasks to RALPHBench — a rigorous benchmark for evaluating autonomous coding agents on long-horizon SWE tasks.

RALPHBench (Recurrent Agents Learning on Long Programming Horizons) evaluates autonomous coding agents on extremely long-horizon software engineering tasks requiring sustained reasoning over 1,000–10,000+ steps on real production codebases.

Our goal is to build the most rigorous and comprehensive benchmark for measuring agent capabilities on the kind of multi-file, multi-day engineering challenges that real developers face — not just single-step bug fixes or isolated code generation.

RALPHBench evaluates:

  1. Whether agents can sustain coherent reasoning over thousands of steps
  2. How effectively agents handle real production-scale codebases
  3. Whether agents can complete full-system engineering tasks end-to-end

How to Get Involved

Getting Access

  1. Reach out via email: contribute@ralphbench.org
  2. Check the GitHub repo for open issues and contribution guidelines

Getting Started

  1. Read through the CONTRIBUTING.md on GitHub for basic context and orientation
  2. Review existing tasks to understand the format and difficulty expectations

What Tasks We Want

Task Categories

Greenfield Projects (build from scratch):

  • Compilers, interpreters, and language tools
  • Web browsers and rendering engines
  • Databases and storage systems
  • Full-stack web applications

Performance & ML Research (optimize and innovate):

  • Performance profiling and optimization
  • Numerical methods and scientific computing
  • ML training pipelines and model engineering

Migration & Refactoring (transform existing code):

  • Framework migrations (e.g., React class → hooks)
  • API redesigns and versioning
  • Large-scale multi-file refactors

Task Requirements

  • Tasks must require 1,000+ steps for a competent agent to complete
  • Real production codebases preferred over synthetic exercises
  • Harbor format with oracle solution at 100% pass rate
  • All tests must be deterministic and reproducible
  • 4-hour wall-clock timeout per task

Task Format

Tasks follow the Harbor format:

task-name/
├── instruction.md          # REQUIRED - Task description
├── task.toml               # REQUIRED - Metadata and timeouts
├── environment/
│   └── Dockerfile          # REQUIRED - Container with dependencies
├── solution/
│   └── solve.sh            # REQUIRED - Oracle solution (must pass 100%)
└── tests/
    ├── test.sh             # REQUIRED - Runs pytest
    └── test_outputs.py     # REQUIRED - Writes reward to /logs/verifier/reward.txt

Evaluation Methodology

  • Pass@1 Scoring: No partial credit — all tests must pass
  • Step Counting: Measures agent efficiency (fewer steps = better)
  • Time Limits: 4-hour wall-clock timeout per task

FAQ

Q: How do I qualify for authorship?

3 high-quality tasks merged to main = automatic authorship

Q: What if I contribute fewer tasks but help with other work?

We consider all contributions: infrastructure, experiments, paper writing, and more. Reach out to discuss!

Resources

Harbor Framework

harbor run --dataset <path> --agent <agent-name>    # run tasks
harbor tasks check                                  # validate task format