notesum.ai
Published at October 31DetectRL: Benchmarking LLM-Generated Text Detection in Real-World Scenarios
cs.CL
cs.AI
Released Date: October 31, 2024
Authors: Junchao Wu1, Runzhe Zhan1, Derek F. Wong1, Shu Yang1, Xinyi Yang1, Yulin Yuan2, Lidia S. Chao1
Aff.: 1NLP2CT Lab, Department of Computer and Information Science, University of Macau; 2Department of Chinese Language and Literature, University of Macau

| Leaderboard: LLM-Generated Text Detector in Real-World Scenarios | ||||||||||||||
| Tasks Settings | Multi- | Multi- | Multi- | Generalization | Time | Human | Avg. | |||||||
| Domain | LLM | Attack | Domain | LLM | Attack | Train | Test | Writing | ||||||
| Detectors | AUROC | AUROC | AUROC | AUROC | ||||||||||
| Rob-Base | 99.98 | 99.75 | 99.93 | 99.58 | 99.56 | 97.66 | 83.00 | 91.81 | 92.37 | 79.99 | 74.00 | 97.34 | 94.31 | 93.02 |
| Rob-Large | 99.78 | 98.87 | 95.16 | 90.03 | 99.87 | 99.03 | 77.20 | 82.85 | 83.96 | 86.08 | 85.23 | 96.68 | 94.63 | 91.49 |
| X-Rob-Base | 99.92 | 99.34 | 99.14 | 98.17 | 98.49 | 96.07 | 75.97 | 92.73 | 90.58 | 84.25 | 73.83 | 93.43 | 90.29 | 91.71 |
| X-Rob-Large | 99.01 | 97.44 | 97.40 | 93.47 | 99.31 | 97.75 | 76.14 | 85.89 | 73.42 | 86.35 | 79.83 | 97.21 | 94.43 | 90.59 |
| Binoculars | 83.95 | 78.25 | 83.30 | 74.83 | 85.05 | 78.53 | 77.47 | 74.10 | 74.70 | 73.82 | 74.34 | 90.68 | 85.98 | 79.61 |
| Revise-Detect. | 67.24 | 60.82 | 66.36 | 53.72 | 70.89 | 57.24 | 54.50 | 53.28 | 50.63 | 65.71 | 67.96 | 83.29 | 82.16 | 64.13 |
| Log-Rank | 64.43 | 57.53 | 63.75 | 54.18 | 68.52 | 55.15 | 55.10 | 52.78 | 51.28 | 57.44 | 59.74 | 88.46 | 83.85 | 62.48 |
| LRR | 65.47 | 55.45 | 64.93 | 53.01 | 68.53 | 57.99 | 54.61 | 52.73 | 57.41 | 57.09 | 58.15 | 85.99 | 80.56 | 62.46 |
| Log-Likelihood | 63.71 | 56.36 | 62.97 | 53.13 | 67.97 | 54.38 | 53.37 | 51.77 | 50.73 | 57.92 | 59.28 | 88.48 | 83.75 | 61.83 |
| DNA-GPT | 64.92 | 55.83 | 64.36 | 51.09 | 68.36 | 53.36 | 51.51 | 47.09 | 41.98 | 57.63 | 62.43 | 87.80 | 82.77 | 60.70 |
| Fast-DetectGPT | 58.52 | 48.07 | 59.58 | 46.55 | 60.70 | 50.63 | 48.35 | 36.56 | 49.47 | 61.31 | 55.08 | 76.03 | 68.47 | 55.33 |
| Rank | 51.34 | 44.97 | 50.33 | 42.06 | 57.08 | 48.83 | 42.61 | 41.49 | 38.84 | 41.67 | 46.65 | 83.86 | 80.00 | 51.52 |
| NPR | 48.37 | 41.41 | 47.27 | 40.04 | 53.49 | 45.22 | 38.58 | 38.83 | 36.10 | 37.60 | 42.17 | 80.03 | 75.98 | 48.08 |
| DetectGPT | 34.43 | 21.52 | 34.93 | 14.80 | 36.19 | 19.15 | 11.54 | 13.11 | 11.84 | 35.78 | 34.69 | 60.86 | 48.76 | 29.05 |
| Entropy | 46.02 | 27.40 | 46.97 | 34.25 | 43.75 | 24.69 | 25.06 | 31.07 | 16.53 | 13.38 | 15.99 | 22.39 | 16.60 | 28.01 |