notesum.ai

Published at November 1

Mitigating Tail Narrowing in LLM Self-Improvement via Socratic-Guided Sampling

cs.CL

cs.AI

cs.LG

Released Date: November 1, 2024

Authors: Yiwen Ding¹, Zhiheng Xi¹, Wei He¹, Zhuoyuan Li², Yitao Zhai³, Xiaowei Shi³, Xunliang Cai³, Tao Gui¹, Qi Zhang¹, Xuanjing Huang¹

Aff.: ¹Fudan University; ²Macau University of Science and Technology; ³Meituan

Arxiv: http://arxiv.org/abs/2411.00750v1

Models	Methods	Sample Budget	Coverage	Held-in Datasets				Held-out Datasets
Models	Methods	Sample Budget	Coverage	Avg.	AQuA	GSM8K	MATH	Avg.	MathQA	SVAMP	Thm.QA
Llama2-7B	SFT	-	-	$21.00$	$27.17$	$31.31$	$4.52$	$21.08$	$22.08$	$36.80$	$4.38$
	Self-Improve ( $k=8$ )	$0.24\text{M}$	$52.6\%$	$21.53$	$25.59$	$33.13$	$5.86$	$24.46$	$27.37$	$39.50$	$6.50$
	Self-Improve ( $k=64$ )	$1.11\text{M}$	$77.1\%$	$23.73$	$29.53$	$35.86$	$5.80$	$23.42$	$25.76$	$39.50$	$5.00$
	Self-Improve ( $k=128$ )	$1.67\text{M}$	$80.7\%$	$23.84$	$27.17$	$37.98$	$6.36$	$24.22$	$26.57$	$40.60$	$5.50$
\cdashline2-12\cdashline2-12	Guided Self-Improve ( $k=8$ )
	+ Answer-driven	$0.37\text{M}$	$99.0\%$	$23.44$	$28.34$	$35.94$	$6.04$	$25.27$	$25.83$	$\mathbf{43.60}$	$6.38$
	+ Rationale-driven	$0.34\text{M}$	$99.9\%$	$24.32$	$30.32$	$36.16$	$\mathbf{6.48}$	$25.68$	$28.11$	$41.30$	$\mathbf{7.63}$
	+ Interactive Sampling	$0.36\text{M}$	$96.3\%$	$25.00$	$30.71$	$37.83$	$\underline{6.46}$	$26.25$	$27.84$	$43.40$	$7.50$
	+ State Reset	$0.38\text{M}$	$82.0\%$	$\mathbf{25.91}$	$\mathbf{31.10}$	$\mathbf{40.18}$	$6.44$	$\mathbf{26.79}$	$\mathbf{29.45}$	$43.30$	$\mathbf{7.63}$
Llama3-8B	SFT	-	-	$37.27$	$39.37$	$57.47$	$14.96$	$38.68$	$44.29$	$63.00$	$8.75$
	Self-Improve ( $k=8$ )	$0.24\text{M}$	$68.6\%$	$36.87$	$39.76$	$59.14$	$11.70$	$39.09$	$45.63$	$62.40$	$9.25$
	Self-Improve ( $k=64$ )	$0.85\text{M}$	$86.4\%$	$38.16$	$38.98$	$61.11$	$14.40$	$38.30$	$45.86$	$60.90$	$8.13$
	Self-Improve ( $k=128$ )	$1.26\text{M}$	$88.7\%$	$37.55$	$41.34$	$61.64$	$9.68$	$39.32$	$45.90$	$62.80$	$9.25$
\cdashline2-12 \cdashline2-12	Guided Self-Improve ( $k=8$ )
	+ Answer-driven	$0.31\text{M}$	$98.5\%$	$38.71$	$42.52$	$59.82$	$13.80$	$40.15$	$46.85$	$63.60$	$10.00$
	+ Rationale-driven	$0.29\text{M}$	$99.8\%$	$39.14$	$42.91$	$60.27$	$14.24$	$41.08$	$47.94$	$\mathbf{65.30}$	$10.00$
	+ Interactive Sampling	$0.31\text{M}$	$97.3\%$	$39.34$	$42.52$	$60.05$	$15.46$	$41.12$	$47.24$	$64.50$	$\mathbf{11.63}$
	+ State Reset	$0.32\text{M}$	$90.2\%$	$\mathbf{41.64}$	$\mathbf{46.46}$	$\mathbf{62.62}$	$\mathbf{15.54}$	$\mathbf{41.33}$	$\mathbf{49.25}$	$65.00$	$9.75$
DeepSeek- Math-7B	SFT	-	-	$50.01$	$60.63$	$60.73$	$28.66$	$21.15$	$21.61$	$37.10$	$4.75$
	Self-Improve ( $k=8$ )	$0.24\text{M}$	$79.8\%$	$52.22$	$56.69$	$68.92$	$31.06$	$49.63$	$64.26$	$67.50$	$17.13$
	Self-Improve ( $k=64$ )	$0.64\text{M}$	$91.5\%$	$53.95$	$57.87$	$70.89$	$\mathbf{33.08}$	$50.92$	$65.36$	$70.40$	$17.00$
	Self-Improve ( $k=128$ )	$0.93\text{M}$	$92.2\%$	$52.43$	$59.45$	$72.02$	$25.82$	$48.76$	$64.69$	$68.20$	$13.38$
\cdashline2-12\cdashline2-12	Guided Self-Improve ( $k=8$ )
	+ Answer-driven	$0.30\text{M}$	$99.5\%$	$53.83$	$61.02$	$70.05$	$30.42$	$51.80$	$64.32$	$\mathbf{72.70}$	$\mathbf{18.38}$
	+ Rationale-driven	$0.29\text{M}$	$99.7\%$	$53.71$	$58.66$	$71.65$	$30.82$	$51.20$	$64.02$	$72.20$	$17.38$
	+ Interactive Sampling	$0.30\text{M}$	$96.9\%$	$\mathbf{55.67}$	$\mathbf{61.81}$	$72.63$	$32.56$	$51.69$	$\mathbf{66.83}$	$71.10$	$17.13$
	+ State Reset	$0.31\text{M}$	$93.6\%$	$55.04$	$59.45$	$\mathbf{72.78}$	$32.90$	$\mathbf{51.85}$	$64.59$	$\mathbf{72.70}$	$18.25$
Mistral-7B	SFT	-	-	$28.27$	$31.10$	$44.96$	$8.74$	$21.08$	$22.08$	$36.80$	$4.38$
	Self-Improve ( $k=8$ )	$0.24\text{M}$	$61.9\%$	$25.22$	$27.95$	$40.56$	$7.14$	$26.96$	$33.84$	$43.30$	$3.75$
	Self-Improve ( $k=64$ )	$0.87\text{M}$	$82.5\%$	$28.23$	$28.74$	$47.16$	$8.80$	$30.16$	$34.94$	$49.90$	$5.36$
	Self-Improve ( $k=128$ )	$1.27\text{M}$	$85.1\%$	$28.32$	$30.32$	$45.79$	$8.84$	$28.81$	$33.03$	$49.40$	$4.00$
\cdashline2-12\cdashline2-12	Guided Self-Improve ( $k=8$ )
	+ Answer-driven	$0.33\text{M}$	$98.3\%$	$28.09$	$\mathbf{34.65}$	$42.00$	$7.62$	$29.70$	$34.34$	$49.90$	$4.88$
	+ Rationale-driven	$0.31\text{M}$	$99.7\%$	$29.13$	$32.68$	$46.17$	$8.54$	$\mathbf{32.14}$	$\mathbf{35.24}$	$\mathbf{53.80}$	$\mathbf{7.38}$
	+ Interactive Sampling	$0.31\text{M}$	$96.4\%$	$29.04$	$32.28$	$45.87$	$8.98$	$30.22$	$35.04$	$50.00$	$5.63$
	+ State Reset	$0.34\text{M}$	$86.7\%$	$\mathbf{31.23}$	$33.07$	$\mathbf{50.95}$	$\mathbf{9.68}$	$30.21$	$34.94$	$50.20$	$5.50$