transceiver-db/sync/history/2026-05-06-magatama-training-corpus-dedupe.md
2026-05-06 10:46:56 +02:00

1.6 KiB

MAGATAMA Training Corpus Dedupe

Date: 2026-05-06
Author: Codex

Summary

After fixing MAGATAMA so verified fixes write to fixes.jsonl and failures go to errors.jsonl, the historical MAGATAMA training corpus was scrubbed and deduplicated.

Actions

  • Added cleanup helper:
    • magatama/scripts/scrub_training_corpus.mjs
  • Ran corpus cleanup locally against:
    • training-data/gitea-learning-pool/magatamallm/fixes.jsonl
  • Kept a backup of the previous large corpus:
    • training-data/gitea-learning-pool/magatamallm/fixes-pre-dedupe-20260506.jsonl
  • Synced cleaned corpus back to Erik.
  • Pulled live errors.jsonl back from Erik so laptop and server match.

Final corpus state

  • fixes.jsonl
    • 1,368 unique verified rows
  • errors.jsonl
    • 4 failed/escalated rows
  • corpus-integrity-report.json
    • scanned: 1368
    • verified: 1368
    • movedToErrors: 4
    • parseErrors: 0
    • invalidVerifiedFlag: 0

Important interpretation

The earlier corpus looked much larger because of repeated duplicate training rows. The cleaned corpus is smaller but substantially more honest and trainable.

Do not treat raw historical line count as real training value.

Live sync status

Confirmed on Erik:

  • /opt/magatama/.../fixes.jsonl has 1368 rows
  • /opt/magatama/.../errors.jsonl has 4 rows
  • /opt/magatama/.../corpus-integrity-report.json matches the latest local report

Recommendation

Use the cleaned verified corpus for the next MagatamaLLM full run, and keep errors.jsonl as a separate failure-analysis lane rather than mixing it back into the main SFT corpus.