notesum.ai
Published at November 22RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts
cs.AI
Released Date: November 22, 2024
Authors: Hjalmar Wijk1, Tao Lin1, Joel Becker1, Sami Jawhar1, Neev Parikh1, Thomas Broadley1, Lawrence Chan1, Michael Chen1, Josh Clymer1, Jai Dhyani1, Elena Ericheva1, Katharyn Garcia1, Brian Goodrich1, Nikola Jurkovic1, Megan Kinniment1, Aron Lajko1, Seraphina Nix1, Lucas Sato1, William Saunders1, Maksym Taran1, Ben West1, Elizabeth Barnes1
Aff.: 1Model Evaluation and Threat Research (METR)
![[Uncaptioned image]](https://arxiv.org/html/2411.15114v1/images/image3.png)
| Task | Issues with environment | |
| Severe | Mild | |
| LLM foundry | 0/5 | 0/5 |
| Scaling law | 0/5 | 5/5 (limitation of infrastructure mean score logs cannot be both recovered and hidden from the agent, so we can only calculate score of final guess) |
| Rust scaffolding | 0/5 | 0/5 |
| Fix embedding | 0/5 | 0/5 |
| Optimize kernel | 0/5 | 5/5 (some timed out scores were registered as 0 in the score log, and had to be manually removed) |
| GPT-2 RL | 0/5 | 0/5 |
| Restricted architecture MLM | 0/5 | 3/5 (some agents submitted solutions that clearly broke the rules but weren’t identified as such by the scoring function) |
| Suite as a whole | 0/35 | 13/35 |