51 lines
1.6 KiB
Markdown
51 lines
1.6 KiB
Markdown
# MAGATAMA Training Corpus Dedupe
|
|
|
|
Date: 2026-05-06
|
|
Author: Codex
|
|
|
|
## Summary
|
|
|
|
After fixing MAGATAMA so verified fixes write to `fixes.jsonl` and failures go to `errors.jsonl`, the historical MAGATAMA training corpus was scrubbed and deduplicated.
|
|
|
|
## Actions
|
|
|
|
- Added cleanup helper:
|
|
- `magatama/scripts/scrub_training_corpus.mjs`
|
|
- Ran corpus cleanup locally against:
|
|
- `training-data/gitea-learning-pool/magatamallm/fixes.jsonl`
|
|
- Kept a backup of the previous large corpus:
|
|
- `training-data/gitea-learning-pool/magatamallm/fixes-pre-dedupe-20260506.jsonl`
|
|
- Synced cleaned corpus back to Erik.
|
|
- Pulled live `errors.jsonl` back from Erik so laptop and server match.
|
|
|
|
## Final corpus state
|
|
|
|
- `fixes.jsonl`
|
|
- `1,368` unique verified rows
|
|
- `errors.jsonl`
|
|
- `4` failed/escalated rows
|
|
- `corpus-integrity-report.json`
|
|
- `scanned: 1368`
|
|
- `verified: 1368`
|
|
- `movedToErrors: 4`
|
|
- `parseErrors: 0`
|
|
- `invalidVerifiedFlag: 0`
|
|
|
|
## Important interpretation
|
|
|
|
The earlier corpus looked much larger because of repeated duplicate training rows. The cleaned corpus is smaller but substantially more honest and trainable.
|
|
|
|
Do not treat raw historical line count as real training value.
|
|
|
|
## Live sync status
|
|
|
|
Confirmed on Erik:
|
|
|
|
- `/opt/magatama/.../fixes.jsonl` has `1368` rows
|
|
- `/opt/magatama/.../errors.jsonl` has `4` rows
|
|
- `/opt/magatama/.../corpus-integrity-report.json` matches the latest local report
|
|
|
|
## Recommendation
|
|
|
|
Use the cleaned verified corpus for the next MagatamaLLM full run, and keep `errors.jsonl` as a separate failure-analysis lane rather than mixing it back into the main SFT corpus.
|