sync: record magatama training corpus dedupe

This commit is contained in:
Rene Fichtmueller 2026-05-06 10:46:56 +02:00
parent bb75a5526b
commit ce37d4155a
2 changed files with 70 additions and 0 deletions

View File

@ -85,6 +85,21 @@ When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infr
- two explicit Codex-written training entries were appended to the MAGATAMA Gitea-backed fixes corpus:
- atlas coverage scope hardening
- training path integrity fix
- corpus cleanup + dedupe was executed afterward:
- pre-dedupe backup kept locally as:
- `magatama/training-data/gitea-learning-pool/magatamallm/fixes-pre-dedupe-20260506.jsonl`
- resulting verified corpus:
- `fixes.jsonl = 1,368` unique verified training rows
- resulting failure corpus:
- `errors.jsonl = 4` tracked failed/escalated rows
- integrity report now exists at:
- `magatama/training-data/gitea-learning-pool/magatamallm/corpus-integrity-report.json`
- latest integrity totals:
- `scanned: 1368`
- `verified: 1368`
- `movedToErrors: 4`
- `parseErrors: 0`
- `invalidVerifiedFlag: 0`
- Complete Codex chat sync was added:
- `sync/history/2026-04-29-codex-complete-chat-sync.md`
- captures Ghost/blog updates, LinkedIn voice preferences, LPO/AI-fabric blog edits, Rest-Is-Not-Laziness scheduling replacement, and security notes.
@ -146,6 +161,11 @@ When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infr
- Meaning:
- the old `0` was incorrect.
- the currently visible trainable MAGATAMA corpus is based on verified and deduplicated examples only.
- Latest corpus integrity state after cleanup:
- operational Gitea-backed MAGATAMA training corpus is now much smaller but cleaner:
- `1368` unique verified rows
- `4` live failure/escalation rows in `errors.jsonl`
- do not confuse raw historical volume with real trainable signal.
- Important training integrity rule:
- report-only or failed/escalated records must not be treated as verified training fixes.
- keep them separated from the main verified training corpus.

View File

@ -0,0 +1,50 @@
# MAGATAMA Training Corpus Dedupe
Date: 2026-05-06
Author: Codex
## Summary
After fixing MAGATAMA so verified fixes write to `fixes.jsonl` and failures go to `errors.jsonl`, the historical MAGATAMA training corpus was scrubbed and deduplicated.
## Actions
- Added cleanup helper:
- `magatama/scripts/scrub_training_corpus.mjs`
- Ran corpus cleanup locally against:
- `training-data/gitea-learning-pool/magatamallm/fixes.jsonl`
- Kept a backup of the previous large corpus:
- `training-data/gitea-learning-pool/magatamallm/fixes-pre-dedupe-20260506.jsonl`
- Synced cleaned corpus back to Erik.
- Pulled live `errors.jsonl` back from Erik so laptop and server match.
## Final corpus state
- `fixes.jsonl`
- `1,368` unique verified rows
- `errors.jsonl`
- `4` failed/escalated rows
- `corpus-integrity-report.json`
- `scanned: 1368`
- `verified: 1368`
- `movedToErrors: 4`
- `parseErrors: 0`
- `invalidVerifiedFlag: 0`
## Important interpretation
The earlier corpus looked much larger because of repeated duplicate training rows. The cleaned corpus is smaller but substantially more honest and trainable.
Do not treat raw historical line count as real training value.
## Live sync status
Confirmed on Erik:
- `/opt/magatama/.../fixes.jsonl` has `1368` rows
- `/opt/magatama/.../errors.jsonl` has `4` rows
- `/opt/magatama/.../corpus-integrity-report.json` matches the latest local report
## Recommendation
Use the cleaned verified corpus for the next MagatamaLLM full run, and keep `errors.jsonl` as a separate failure-analysis lane rather than mixing it back into the main SFT corpus.