notesum.ai

Published at November 6

Self-Consistency Preference Optimization

cs.CL

cs.AI

cs.LG

Released Date: November 6, 2024

Authors: Archiki Prasad¹, Weizhe Yuan¹, Richard Yuanzhe Pang¹, Jing Xu¹, Maryam Fazel-Zarandi¹, Mohit Bansal², Sainbayar Sukhbaatar¹, Jason Weston¹, Jane Yu¹

Aff.: ¹Meta FAIR; ²UNC Chapel Hill

Arxiv: http://arxiv.org/abs/2411.04109v1

Method	Train Data (K)		Puzzle Acc. (%)			Cell Acc.
	# Seed /	Gen.	Overall	Easy	Hard	(%)
Llama-3 Instruct 70B	- /	-	17.2	52.1	3.6	42.9
Gemma-2 27B IT^∗	- /	-	16.3	50.7	2.9	41.2
Claude-3 Haiku^∗	- /	-	14.3	47.9	1.2	37.9
$M_{0}$ (Llama-3 Instruct 8B)	- /	-	11.6	40.0	0.4	39.1
$M_{1}$ w/ IRPO_RM	1.0 /	-	11.3	37.9	1.0	42.1
$M_{1}$ w/ ScPO_Unsup.	0.4 /	1.0	17.0	54.3	2.5	47.6
$M_{2}$ w/ ScPO_Unsup.	0.4 /	2.2	18.1	58.2	2.5	45.2