notesum.ai
Published at October 23CLR-Bench: Evaluating Large Language Models in College-level Reasoning
cs.LG
cs.AI
I.2.4
Released Date: October 23, 2024
Authors: Junnan Dong1, Zijin Hong1, Yuanchen Bei1, Feiran Huang2, Xinrun Wang3, Xiao Huang1
Aff.: 1The Hong Kong Polytechnic University; 2Jinnan University; 3Singapore Management University

| # | Models | QA (Direct performance of prediction) | QAR (Reasoning performance of rationale) | ||||||||||
| MC | MS | TF | FB | OE | QA | MC | MS | TF | FB | OE | QAR | ||
| 1 | qwen2.5-7b | 69.12% | 54.95% | 54.11% | 40.00% | 49.07% | 54.62% | 43.55% | 34.46% | 43.43% | 45.00% | 8.36% | 33.37% |
| 2 | gemma2-9b-it | 72.81% | 50.45% | 55.70% | 43.81% | 40.52% | 53.54% | 45.62% | 32.88% | 47.15% | 44.52% | 7.90% | 34.63% |
| 3 | qwen2.5-72b | 80.18% | 73.42% | 60.44% | 47.62% | 52.04% | 62.52% | 50.00% | 43.02% | 45.97% | 49.52% | 11.99% | 37.89% |
| 4 | gemma2-9b | 16.13% | 47.75% | 11.71% | 9.52% | 11.34% | 16.26% | 11.29% | 8.33% | 10.60% | 11.67% | 3.62% | 8.77% |
| 5 | mixtral-8x7b-instruct-v0.1 | 59.91% | 45.05% | 34.18% | 16.19% | 43.12% | 41.36% | 42.97% | 31.31% | 39.08% | 34.52% | 8.36% | 30.48% |
| 6 | phi-3-medium-4k-instruct | 74.65% | 63.96% | 56.01% | 38.10% | 48.88% | 57.12% | 51.50% | 33.56% | 49.68% | 44.05% | 11.34% | 37.60% |
| 7 | qwen2.5-72b-instruct | 81.11% | 77.93% | 63.61% | 45.71% | 52.23% | 64.05% | 50.58% | 45.50% | 53.56% | 48.10% | 13.10% | 40.79% |
| 8 | llama-2-7b | 54.38% | 15.77% | 44.62% | 19.05% | 36.43% | 38.75% | 26.61% | 18.47% | 27.29% | 24.05% | 6.78% | 20.43% |
| 9 | llama-3.1-8b-instruct | 63.13% | 51.80% | 56.96% | 25.71% | 41.82% | 50.49% | 39.06% | 22.97% | 41.85% | 33.57% | 8.55% | 29.54% |
| 10 | phi-3-mini-4k-instruct | 69.12% | 59.46% | 55.38% | 34.29% | 44.80% | 53.78% | 49.42% | 37.84% | 45.41% | 41.43% | 9.57% | 35.56% |
| 11 | yi-1.5-6b-chat | 37.33% | 53.15% | 48.10% | 25.71% | 38.85% | 41.60% | 30.30% | 32.43% | 32.99% | 34.05% | 6.88% | 25.56% |
| 12 | deepseek-7b-chat | 49.31% | 27.48% | 45.57% | 20.00% | 31.41% | 38.02% | 30.76% | 18.92% | 34.97% | 27.62% | 5.11% | 23.67% |
| 13 | yi-1.5-34b-chat | 68.66% | 53.60% | 56.96% | 40.95% | 39.96% | 52.95% | 40.90% | 33.33% | 48.34% | 44.76% | 7.16% | 33.87% |
| 14 | deepseek-7b-base | 49.77% | 36.49% | 44.94% | 18.10% | 36.62% | 40.08% | 27.88% | 18.47% | 28.16% | 25.24% | 5.39% | 20.73% |
| 15 | llama-3.1-70b-instruct | 80.18% | 74.77% | 61.39% | 40.00% | 44.61% | 60.22% | 44.24% | 38.06% | 45.09% | 38.57% | 7.99% | 33.67% |
| 16 | llama-3-8b | 59.91% | 35.14% | 50.95% | 26.67% | 43.31% | 46.61% | 32.37% | 24.10% | 31.09% | 29.29% | 7.53% | 24.19% |
| 17 | mistral-7b-instruct-v0.1 | 57.60% | 45.05% | 40.82% | 21.90% | 42.19% | 43.27% | 35.48% | 27.48% | 34.41% | 33.33% | 6.04% | 26.28% |
| 18 | gemma-7b-it | 15.21% | 37.84% | 45.25% | 8.57% | 13.38% | 25.83% | 13.02% | 1.80% | 35.52% | 20.24% | 10.04% | 18.74% |
| 19 | qwen1.5-7b-chat | 32.26% | 50.00% | 11.71% | 13.33% | 35.32% | 26.67% | 27.42% | 26.58% | 27.22% | 33.57% | 6.41% | 22.35% |
| 20 | yi-1.5-6b | 53.00% | 59.01% | 51.27% | 30.48% | 42.75% | 48.08% | 33.53% | 29.95% | 39.16% | 35.00% | 6.13% | 27.80% |
| 21 | qwen2.5-32b-instruct | 80.18% | 75.23% | 63.29% | 44.76% | 52.04% | 63.31% | 55.41% | 45.50% | 56.09% | 48.33% | 11.80% | 42.29% |
| 22 | qwen2.5-7b-instruct | 70.97% | 63.06% | 58.23% | 37.14% | 49.26% | 56.93% | 52.65% | 34.23% | 50.32% | 41.90% | 8.92% | 37.25% |
| 23 | llama-3.1-8b | 62.67% | 40.54% | 51.58% | 30.48% | 44.98% | 48.82% | 34.22% | 28.83% | 32.52% | 33.57% | 7.99% | 26.11% |
| 24 | llama-3.2-3b-instruct | 57.14% | 39.19% | 49.05% | 23.81% | 34.94% | 43.37% | 40.55% | 21.40% | 38.29% | 32.62% | 8.09% | 28.36% |
| 25 | llama-3-8b-instruct | 60.83% | 49.55% | 52.22% | 29.52% | 47.40% | 50.15% | 43.78% | 26.35% | 41.30% | 36.19% | 9.76% | 31.34% |
| 26 | openchat-3.5 | 60.83% | 39.64% | 49.68% | 25.71% | 36.25% | 44.94% | 35.94% | 22.75% | 33.86% | 30.24% | 8.46% | 26.01% |
| 27 | qwen1.5-7b | 58.99% | 54.50% | 47.15% | 37.14% | 38.29% | 47.10% | 36.64% | 33.11% | 32.99% | 35.95% | 6.78% | 27.16% |
| 28 | llama-2-7b-chat | 38.25% | 36.49% | 39.24% | 13.33% | 35.50% | 35.07% | 26.04% | 16.89% | 31.49% | 26.67% | 5.30% | 21.32% |
| 29 | mistral-7b-v0.1 | 58.06% | 42.79% | 50.63% | 37.14% | 43.87% | 48.18% | 33.06% | 29.05% | 34.10% | 35.24% | 7.16% | 26.33% |
| 30 | yi-1.5-34b | 70.97% | 58.11% | 57.59% | 41.90% | 48.33% | 56.43% | 40.55% | 35.59% | 42.96% | 43.10% | 7.25% | 32.22% |
| 31 | mixtral-8x7b-v0.1 | 67.28% | 39.64% | 53.48% | 39.05% | 45.72% | 51.38% | 39.06% | 31.53% | 36.00% | 36.19% | 8.83% | 29.00% |
| 32 | llama-3.1-70b | 75.58% | 51.80% | 57.59% | 45.71% | 47.77% | 56.97% | 43.89% | 35.81% | 41.46% | 45.48% | 10.50% | 33.60% |
| 33 | llama-3.2-1b-instruct | 41.47% | 40.99% | 37.66% | 14.29% | 25.09% | 33.10% | 28.00% | 13.06% | 24.68% | 24.76% | 6.69% | 19.38% |
| 34 | gpt-3.5-turbo | 63.13% | 66.67% | 58.54% | 23.81% | 47.77% | 53.98% | 42.51% | 41.22% | 49.60% | 39.29% | 6.41% | 34.70% |
| 35 | claude-3-sonnet | 76.96% | 76.58% | 59.81% | 42.86% | 48.33% | 60.51% | 50.69% | 44.59% | 48.26% | 45.95% | 11.80% | 38.51% |
| 36 | gemini-1.5-pro | 78.34% | 71.62% | 55.38% | 46.67% | 52.42% | 60.36% | 50.46% | 40.54% | 43.35% | 50.24% | 12.83% | 37.21% |
| 37 | deepseek-chat | 78.80% | 77.03% | 62.66% | 40.95% | 51.30% | 62.43% | 55.41% | 44.14% | 55.14% | 46.43% | 13.48% | 42.09% |
| 38 | gpt-4o | 84.33% | 76.58% | 63.92% | 54.29% | 34.01% | 60.76% | 53.11% | 44.82% | 57.52% | 51.19% | 8.55% | 41.60% |
| 49 | gpt-4-turbo | 82.49% | 78.38% | 63.92% | 50.48% | 45.91% | 63.31% | 51.73% | 45.95% | 49.45% | 51.43% | 8.74% | 39.00% |
| 40 | claude-3-opus | 80.18% | 81.53% | 62.66% | 52.38% | 44.42% | 62.57% | 50.35% | 43.02% | 50.08% | 50.00% | 9.20% | 38.56% |