sync: record magatama training corpus dedupe
This commit is contained in:
parent
bb75a5526b
commit
ce37d4155a
@ -85,6 +85,21 @@ When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infr
|
|||||||
- two explicit Codex-written training entries were appended to the MAGATAMA Gitea-backed fixes corpus:
|
- two explicit Codex-written training entries were appended to the MAGATAMA Gitea-backed fixes corpus:
|
||||||
- atlas coverage scope hardening
|
- atlas coverage scope hardening
|
||||||
- training path integrity fix
|
- training path integrity fix
|
||||||
|
- corpus cleanup + dedupe was executed afterward:
|
||||||
|
- pre-dedupe backup kept locally as:
|
||||||
|
- `magatama/training-data/gitea-learning-pool/magatamallm/fixes-pre-dedupe-20260506.jsonl`
|
||||||
|
- resulting verified corpus:
|
||||||
|
- `fixes.jsonl = 1,368` unique verified training rows
|
||||||
|
- resulting failure corpus:
|
||||||
|
- `errors.jsonl = 4` tracked failed/escalated rows
|
||||||
|
- integrity report now exists at:
|
||||||
|
- `magatama/training-data/gitea-learning-pool/magatamallm/corpus-integrity-report.json`
|
||||||
|
- latest integrity totals:
|
||||||
|
- `scanned: 1368`
|
||||||
|
- `verified: 1368`
|
||||||
|
- `movedToErrors: 4`
|
||||||
|
- `parseErrors: 0`
|
||||||
|
- `invalidVerifiedFlag: 0`
|
||||||
- Complete Codex chat sync was added:
|
- Complete Codex chat sync was added:
|
||||||
- `sync/history/2026-04-29-codex-complete-chat-sync.md`
|
- `sync/history/2026-04-29-codex-complete-chat-sync.md`
|
||||||
- captures Ghost/blog updates, LinkedIn voice preferences, LPO/AI-fabric blog edits, Rest-Is-Not-Laziness scheduling replacement, and security notes.
|
- captures Ghost/blog updates, LinkedIn voice preferences, LPO/AI-fabric blog edits, Rest-Is-Not-Laziness scheduling replacement, and security notes.
|
||||||
@ -146,6 +161,11 @@ When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infr
|
|||||||
- Meaning:
|
- Meaning:
|
||||||
- the old `0` was incorrect.
|
- the old `0` was incorrect.
|
||||||
- the currently visible trainable MAGATAMA corpus is based on verified and deduplicated examples only.
|
- the currently visible trainable MAGATAMA corpus is based on verified and deduplicated examples only.
|
||||||
|
- Latest corpus integrity state after cleanup:
|
||||||
|
- operational Gitea-backed MAGATAMA training corpus is now much smaller but cleaner:
|
||||||
|
- `1368` unique verified rows
|
||||||
|
- `4` live failure/escalation rows in `errors.jsonl`
|
||||||
|
- do not confuse raw historical volume with real trainable signal.
|
||||||
- Important training integrity rule:
|
- Important training integrity rule:
|
||||||
- report-only or failed/escalated records must not be treated as verified training fixes.
|
- report-only or failed/escalated records must not be treated as verified training fixes.
|
||||||
- keep them separated from the main verified training corpus.
|
- keep them separated from the main verified training corpus.
|
||||||
|
|||||||
50
sync/history/2026-05-06-magatama-training-corpus-dedupe.md
Normal file
50
sync/history/2026-05-06-magatama-training-corpus-dedupe.md
Normal file
@ -0,0 +1,50 @@
|
|||||||
|
# MAGATAMA Training Corpus Dedupe
|
||||||
|
|
||||||
|
Date: 2026-05-06
|
||||||
|
Author: Codex
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
After fixing MAGATAMA so verified fixes write to `fixes.jsonl` and failures go to `errors.jsonl`, the historical MAGATAMA training corpus was scrubbed and deduplicated.
|
||||||
|
|
||||||
|
## Actions
|
||||||
|
|
||||||
|
- Added cleanup helper:
|
||||||
|
- `magatama/scripts/scrub_training_corpus.mjs`
|
||||||
|
- Ran corpus cleanup locally against:
|
||||||
|
- `training-data/gitea-learning-pool/magatamallm/fixes.jsonl`
|
||||||
|
- Kept a backup of the previous large corpus:
|
||||||
|
- `training-data/gitea-learning-pool/magatamallm/fixes-pre-dedupe-20260506.jsonl`
|
||||||
|
- Synced cleaned corpus back to Erik.
|
||||||
|
- Pulled live `errors.jsonl` back from Erik so laptop and server match.
|
||||||
|
|
||||||
|
## Final corpus state
|
||||||
|
|
||||||
|
- `fixes.jsonl`
|
||||||
|
- `1,368` unique verified rows
|
||||||
|
- `errors.jsonl`
|
||||||
|
- `4` failed/escalated rows
|
||||||
|
- `corpus-integrity-report.json`
|
||||||
|
- `scanned: 1368`
|
||||||
|
- `verified: 1368`
|
||||||
|
- `movedToErrors: 4`
|
||||||
|
- `parseErrors: 0`
|
||||||
|
- `invalidVerifiedFlag: 0`
|
||||||
|
|
||||||
|
## Important interpretation
|
||||||
|
|
||||||
|
The earlier corpus looked much larger because of repeated duplicate training rows. The cleaned corpus is smaller but substantially more honest and trainable.
|
||||||
|
|
||||||
|
Do not treat raw historical line count as real training value.
|
||||||
|
|
||||||
|
## Live sync status
|
||||||
|
|
||||||
|
Confirmed on Erik:
|
||||||
|
|
||||||
|
- `/opt/magatama/.../fixes.jsonl` has `1368` rows
|
||||||
|
- `/opt/magatama/.../errors.jsonl` has `4` rows
|
||||||
|
- `/opt/magatama/.../corpus-integrity-report.json` matches the latest local report
|
||||||
|
|
||||||
|
## Recommendation
|
||||||
|
|
||||||
|
Use the cleaned verified corpus for the next MagatamaLLM full run, and keep `errors.jsonl` as a separate failure-analysis lane rather than mixing it back into the main SFT corpus.
|
||||||
Loading…
x
Reference in New Issue
Block a user