notesum.ai
Published at December 4Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models
cs.LG
cs.AI
cs.CL
Released Date: December 4, 2024
Authors: Alex Havrilla1, Andrew Dai, Laura O'Mahony, Koen Oostermeijer, Vera Zisler, Alon Albalak, Fabrizio Milo, Sharath Chandra Raparthy, Kanishk Gandhi, Baber Abbasi, Duy Phung, Maia Iyer, Dakota Mahan, Chase Blagden, Srishti Gureja, Mohammed Hamdy, Wen-Ding Li, Giovanni Paolini, Pawan Sasanka Ammanamanchi, Elliot Meyerson
Aff.: 1Georgia Tech

| Category | Subcategory | Works |
| Lexical | Inter-Sample N-gram Frequency | (Yu et al., 2023b) |
| Attribute-Based | Number of Unique Task Types | (Wang et al., 2022) |
| Number of Unique Languages | (Wang et al., 2022) | |
| Number of Unique Domains or Topics | (Wang et al., 2022) | |
| Number of Unique Root Verbs and Nouns | (Li et al., 2024d) | |
| Skills Used to Arrive at a Solution | (Pourcel et al., 2024) | |
| Order of Mathematical Operations | (Tian et al., 2024), (Havrilla et al., 2024a) | |
| Attribute Labels | (Zhou et al., 2023a), (Wang et al., 2022) | |
| MAP-Elites Coverage | (Samvelyan et al., 2024b), (Bradley et al., 2023) | |
| Automatically Discovered Attributes | (Yu et al., 2023b) | |
| Embedding | Average Pairwise Cosine Similarity | (Yu et al., 2023b), (Kirk et al., 2024), (Yu et al., 2024) |
| i-th Nearest Neighbor | (Cao et al., 2024) | |
| SemDeDup | (Abbas et al., 2023) | |
| Facility Location Function | (Reimers & Gurevych, 2019), (Bukharin & Zhao, 2024) | |
| Other | Diversity Coefficient for Natural Language | (Lee et al., 2023) |
| NLI Diversity | (Kirk et al., 2024) | |
| MTLD | (McCarthy & Jarvis, 2009) | |
| N-gram Frequency | (Li et al., 2016a) |