1.6 KiB
1.6 KiB
MAGATAMA Training Corpus Dedupe
Date: 2026-05-06
Author: Codex
Summary
After fixing MAGATAMA so verified fixes write to fixes.jsonl and failures go to errors.jsonl, the historical MAGATAMA training corpus was scrubbed and deduplicated.
Actions
- Added cleanup helper:
magatama/scripts/scrub_training_corpus.mjs
- Ran corpus cleanup locally against:
training-data/gitea-learning-pool/magatamallm/fixes.jsonl
- Kept a backup of the previous large corpus:
training-data/gitea-learning-pool/magatamallm/fixes-pre-dedupe-20260506.jsonl
- Synced cleaned corpus back to Erik.
- Pulled live
errors.jsonlback from Erik so laptop and server match.
Final corpus state
fixes.jsonl1,368unique verified rows
errors.jsonl4failed/escalated rows
corpus-integrity-report.jsonscanned: 1368verified: 1368movedToErrors: 4parseErrors: 0invalidVerifiedFlag: 0
Important interpretation
The earlier corpus looked much larger because of repeated duplicate training rows. The cleaned corpus is smaller but substantially more honest and trainable.
Do not treat raw historical line count as real training value.
Live sync status
Confirmed on Erik:
/opt/magatama/.../fixes.jsonlhas1368rows/opt/magatama/.../errors.jsonlhas4rows/opt/magatama/.../corpus-integrity-report.jsonmatches the latest local report
Recommendation
Use the cleaned verified corpus for the next MagatamaLLM full run, and keep errors.jsonl as a separate failure-analysis lane rather than mixing it back into the main SFT corpus.