An ultra-long-horizon benchmark designed to evaluate coding agents on realistic, high-complexity software engineering tasks.
Join the RALPHBench GitHub repo and review the task guidelines.
Tasks follow the Harbor format with deterministic unit tests and full solution.
Open a PR. One approved task earns co-authorship on the NeurIPS 2026 paper.