notesum.ai
Published at November 18LP Data Pipeline: Lightweight, Purpose-driven Data Pipeline for Large Language Models
cs.CL
cs.AI
Released Date: November 18, 2024
Authors: Yungi Kim1, Hyunsoo Ha1, Seonghoon Yang1, Sukyung Lee1, Jihoo Kim1, Chanjun Park1
Aff.: 1Upstage AI

| Processing Phase | Processing Time | Estimated Cost |
|---|---|---|
| Raw Text Extraction | 1h 30m | $128.69 |
| Language Identification | 19m | $20.7 |
| Line-Level Deduplication | 1h | $85.79 |
| Heuristic Filtering | 25m | $27.24 |
| Global Deduplication | 13m | $14.16 |
| Model-based Quality Filtering | 48m | $68.63 |
| Domain Classification | 7m | $7.62 |
| Total | 4h 22m | $352.83 |