From ce37d4155a7132035a4aada8002be50c8804b8ae Mon Sep 17 00:00:00 2001 From: Rene Fichtmueller Date: Wed, 6 May 2026 10:46:56 +0200 Subject: [PATCH] sync: record magatama training corpus dedupe --- sync/CURRENT.md | 20 ++++++++ ...6-05-06-magatama-training-corpus-dedupe.md | 50 +++++++++++++++++++ 2 files changed, 70 insertions(+) create mode 100644 sync/history/2026-05-06-magatama-training-corpus-dedupe.md diff --git a/sync/CURRENT.md b/sync/CURRENT.md index f82ec32..c3789e1 100644 --- a/sync/CURRENT.md +++ b/sync/CURRENT.md @@ -85,6 +85,21 @@ When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infr - two explicit Codex-written training entries were appended to the MAGATAMA Gitea-backed fixes corpus: - atlas coverage scope hardening - training path integrity fix + - corpus cleanup + dedupe was executed afterward: + - pre-dedupe backup kept locally as: + - `magatama/training-data/gitea-learning-pool/magatamallm/fixes-pre-dedupe-20260506.jsonl` + - resulting verified corpus: + - `fixes.jsonl = 1,368` unique verified training rows + - resulting failure corpus: + - `errors.jsonl = 4` tracked failed/escalated rows + - integrity report now exists at: + - `magatama/training-data/gitea-learning-pool/magatamallm/corpus-integrity-report.json` + - latest integrity totals: + - `scanned: 1368` + - `verified: 1368` + - `movedToErrors: 4` + - `parseErrors: 0` + - `invalidVerifiedFlag: 0` - Complete Codex chat sync was added: - `sync/history/2026-04-29-codex-complete-chat-sync.md` - captures Ghost/blog updates, LinkedIn voice preferences, LPO/AI-fabric blog edits, Rest-Is-Not-Laziness scheduling replacement, and security notes. @@ -146,6 +161,11 @@ When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infr - Meaning: - the old `0` was incorrect. - the currently visible trainable MAGATAMA corpus is based on verified and deduplicated examples only. +- Latest corpus integrity state after cleanup: + - operational Gitea-backed MAGATAMA training corpus is now much smaller but cleaner: + - `1368` unique verified rows + - `4` live failure/escalation rows in `errors.jsonl` + - do not confuse raw historical volume with real trainable signal. - Important training integrity rule: - report-only or failed/escalated records must not be treated as verified training fixes. - keep them separated from the main verified training corpus. diff --git a/sync/history/2026-05-06-magatama-training-corpus-dedupe.md b/sync/history/2026-05-06-magatama-training-corpus-dedupe.md new file mode 100644 index 0000000..d863e2b --- /dev/null +++ b/sync/history/2026-05-06-magatama-training-corpus-dedupe.md @@ -0,0 +1,50 @@ +# MAGATAMA Training Corpus Dedupe + +Date: 2026-05-06 +Author: Codex + +## Summary + +After fixing MAGATAMA so verified fixes write to `fixes.jsonl` and failures go to `errors.jsonl`, the historical MAGATAMA training corpus was scrubbed and deduplicated. + +## Actions + +- Added cleanup helper: + - `magatama/scripts/scrub_training_corpus.mjs` +- Ran corpus cleanup locally against: + - `training-data/gitea-learning-pool/magatamallm/fixes.jsonl` +- Kept a backup of the previous large corpus: + - `training-data/gitea-learning-pool/magatamallm/fixes-pre-dedupe-20260506.jsonl` +- Synced cleaned corpus back to Erik. +- Pulled live `errors.jsonl` back from Erik so laptop and server match. + +## Final corpus state + +- `fixes.jsonl` + - `1,368` unique verified rows +- `errors.jsonl` + - `4` failed/escalated rows +- `corpus-integrity-report.json` + - `scanned: 1368` + - `verified: 1368` + - `movedToErrors: 4` + - `parseErrors: 0` + - `invalidVerifiedFlag: 0` + +## Important interpretation + +The earlier corpus looked much larger because of repeated duplicate training rows. The cleaned corpus is smaller but substantially more honest and trainable. + +Do not treat raw historical line count as real training value. + +## Live sync status + +Confirmed on Erik: + +- `/opt/magatama/.../fixes.jsonl` has `1368` rows +- `/opt/magatama/.../errors.jsonl` has `4` rows +- `/opt/magatama/.../corpus-integrity-report.json` matches the latest local report + +## Recommendation + +Use the cleaned verified corpus for the next MagatamaLLM full run, and keep `errors.jsonl` as a separate failure-analysis lane rather than mixing it back into the main SFT corpus.