notesum.ai

Published at November 22

RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts

cs.AI

Released Date: November 22, 2024

Authors: Hjalmar Wijk¹, Tao Lin¹, Joel Becker¹, Sami Jawhar¹, Neev Parikh¹, Thomas Broadley¹, Lawrence Chan¹, Michael Chen¹, Josh Clymer¹, Jai Dhyani¹, Elena Ericheva¹, Katharyn Garcia¹, Brian Goodrich¹, Nikola Jurkovic¹, Megan Kinniment¹, Aron Lajko¹, Seraphina Nix¹, Lucas Sato¹, William Saunders¹, Maksym Taran¹, Ben West¹, Elizabeth Barnes¹

Aff.: ¹Model Evaluation and Threat Research (METR)

Arxiv: http://arxiv.org/abs/2411.15114v1

Task	Issues with environment
Task	Severe	Mild
LLM foundry	0/5	0/5
Scaling law	0/5	5/5 (limitation of infrastructure mean score logs cannot be both recovered and hidden from the agent, so we can only calculate score of final guess)
Rust scaffolding	0/5	0/5
Fix embedding	0/5	0/5
Optimize kernel	0/5	5/5 (some timed out scores were registered as 0 in the score log, and had to be manually removed)
GPT-2 RL	0/5	0/5
Restricted architecture MLM	0/5	3/5 (some agents submitted solutions that clearly broke the rules but weren’t identified as such by the scoring function)
Suite as a whole	0/35	13/35