transceiver-db/sync/history/2026-05-06-magatama-training-corpus-dedupe.md
2026-05-06 10:46:56 +02:00

51 lines
1.6 KiB
Markdown

# MAGATAMA Training Corpus Dedupe
Date: 2026-05-06
Author: Codex
## Summary
After fixing MAGATAMA so verified fixes write to `fixes.jsonl` and failures go to `errors.jsonl`, the historical MAGATAMA training corpus was scrubbed and deduplicated.
## Actions
- Added cleanup helper:
- `magatama/scripts/scrub_training_corpus.mjs`
- Ran corpus cleanup locally against:
- `training-data/gitea-learning-pool/magatamallm/fixes.jsonl`
- Kept a backup of the previous large corpus:
- `training-data/gitea-learning-pool/magatamallm/fixes-pre-dedupe-20260506.jsonl`
- Synced cleaned corpus back to Erik.
- Pulled live `errors.jsonl` back from Erik so laptop and server match.
## Final corpus state
- `fixes.jsonl`
- `1,368` unique verified rows
- `errors.jsonl`
- `4` failed/escalated rows
- `corpus-integrity-report.json`
- `scanned: 1368`
- `verified: 1368`
- `movedToErrors: 4`
- `parseErrors: 0`
- `invalidVerifiedFlag: 0`
## Important interpretation
The earlier corpus looked much larger because of repeated duplicate training rows. The cleaned corpus is smaller but substantially more honest and trainable.
Do not treat raw historical line count as real training value.
## Live sync status
Confirmed on Erik:
- `/opt/magatama/.../fixes.jsonl` has `1368` rows
- `/opt/magatama/.../errors.jsonl` has `4` rows
- `/opt/magatama/.../corpus-integrity-report.json` matches the latest local report
## Recommendation
Use the cleaned verified corpus for the next MagatamaLLM full run, and keep `errors.jsonl` as a separate failure-analysis lane rather than mixing it back into the main SFT corpus.