notesum.ai

Published at November 22

RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts

cs.AI

Released Date: November 22, 2024

Authors: Hjalmar Wijk1, Tao Lin1, Joel Becker1, Sami Jawhar1, Neev Parikh1, Thomas Broadley1, Lawrence Chan1, Michael Chen1, Josh Clymer1, Jai Dhyani1, Elena Ericheva1, Katharyn Garcia1, Brian Goodrich1, Nikola Jurkovic1, Megan Kinniment1, Aron Lajko1, Seraphina Nix1, Lucas Sato1, William Saunders1, Maksym Taran1, Ben West1, Elizabeth Barnes1

Aff.: 1Model Evaluation and Threat Research (METR)

Arxiv: http://arxiv.org/abs/2411.15114v1