notesum.ai
Published at October 22Evaluating AI-Generated Essays with GRE Analytical Writing Assessment
cs.LG
cs.AI
Released Date: October 22, 2024
Authors: Yang Zhong1, Jiangang Hao2, Michael Fauss, Chen Li2, Yuan Wang2
Aff.: 1University of Pittsburgh; 2ETS Research Institute

| Type | Scores | GPT-4o | Gemini | Llama3-8b | GPT-4 | GPT-3.5 | Bard | Mistral | Koala | Vicuna | Llama |
| (mid-2024) | (mid-2023) | ||||||||||
| \faLock | \faLock | \faUnlock | \faLock | \faLock | \faLock | \faUnlock | \faUnlock | \faUnlock | \faUnlock | ||
| Human Rater | 6 | 6.5% | 6.5% | 0.5% | 2.5% | 4.0% | 0.5% | 0.5% | 0.5% | 0.0% | 0.0% |
| 5 | 54.5% | 64.5% | 16.5% | 48.0% | 36.5% | 12.5% | 5.0% | 2.5% | 8.0% | 0.0% | |
| 4 | 38.5% | 29.0% | 82.5% | 49.5% | 57.0% | 75.5% | 73.5% | 72.5% | 71.5% | 1.5% | |
| 3 | 0.5% | 0.0% | 0.5% | 0.0% | 2.5% | 11.5% | 21.0% | 24.0% | 20.5% | 20.5% | |
| 2 | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.5% | 0.0% | 41.5% | |
| 1 | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 32.0% | |
| 0 | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 4.5% | |
| Mean | 4.67 | 4.78 | 4.17 | 4.53 | 4.42 | 4.02 | 3.85 | 3.79 | 3.88 | 1.82 | |
| e-rater® | 6 | 31.5% | 7.5% | 1.5% | 0.0% | 3.5% | 1.5% | 1.0% | 1.0% | 0.5% | 1.0% |
| 5 | 68.5% | 92.5% | 93.0% | 98.0% | 93.5% | 71.5% | 52.5% | 43.0% | 54.0% | 33.5% | |
| 4 | 0.0% | 0.0% | 5.5% | 0.5% | 3.0% | 28.5% | 44.5% | 55.5% | 45.5% | 44.0% | |
| 3 | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 2.0% | 0.5% | 0.0% | 14.5% | |
| 2 | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 5.0% | |
| 1 | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 1.0% | |
| 0 | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 1.0% | |
| Mean | 5.32 | 5.08 | 4.96 | 5.01 | 5.00 | 4.71 | 4.53 | 4.45 | 4.55 | 4.04 | |