sync: record tip lane detangling and disk-safe refresh
This commit is contained in:
parent
e6f98c89bd
commit
d01039734a
@ -1,6 +1,6 @@
|
|||||||
# Current TIP Sync State
|
# Current TIP Sync State
|
||||||
|
|
||||||
Updated: 2026-05-06 15:48 UTC
|
Updated: 2026-05-06 20:52 UTC
|
||||||
|
|
||||||
## Active Policy
|
## Active Policy
|
||||||
|
|
||||||
@ -27,6 +27,56 @@ When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infr
|
|||||||
|
|
||||||
## Latest Work
|
## Latest Work
|
||||||
|
|
||||||
|
- TIP/Blog lane separation was materially corrected on 2026-05-06:
|
||||||
|
- root cause:
|
||||||
|
- `TIP_LLM` was still ingesting blog-/writer-shaped rows from the canonical lane pool and shared transceiver corpora.
|
||||||
|
- local inspection showed the old TIP export had `6250` train rows, of which `6087` still matched blog/writer patterns.
|
||||||
|
- dataset builder and Gitea sync were hardened:
|
||||||
|
- `scripts/runpod_dataset_builder.ts`
|
||||||
|
- added strict `tipDatasetAllowed(...)`
|
||||||
|
- `TIP_LLM` now rejects blog-shaped source rows at dataset-build time
|
||||||
|
- `TIP_LLM` now rejects blog-like `system`, `user`, and markdown-article `assistant` patterns
|
||||||
|
- registry fallback for `TIP_LLM` now only uses lane-compatible datasets
|
||||||
|
- `scripts/sync_gitea_training_pool.ts`
|
||||||
|
- canonical TIP pool refresh now uses the stricter lane-alignment rules
|
||||||
|
- redundant `merged.jsonl` copies for `fo_blogllm` and `tip_llm` are no longer rewritten, to avoid local disk exhaustion from duplicate lane artifacts
|
||||||
|
- local disk issue encountered and fixed:
|
||||||
|
- full refresh failed with `ENOSPC` while writing `training-data/gitea-learning-pool/tip_llm/merged.jsonl`
|
||||||
|
- redundant lane `merged` artifacts for `fo_blogllm` and `tip_llm` were truncated and the sync script was changed to stop recreating them
|
||||||
|
- free disk space returned from `377Mi` to `17Gi`
|
||||||
|
- locally verified after rebuild:
|
||||||
|
- `TIP_LLM` RunPod export:
|
||||||
|
- `train = 233`
|
||||||
|
- `eval = 26`
|
||||||
|
- `total = 259`
|
||||||
|
- `blog/writer matches = 0`
|
||||||
|
- first TIP rows now use the correct TIP system prompt:
|
||||||
|
- `You are TIP_LLM, a research and market-intelligence analyst for transceivers, switches, and vendor ecosystems...`
|
||||||
|
- corrected artifacts and scripts were synced to Erik and `pnpm training:refresh-all` was rerun there.
|
||||||
|
- live verified on Erik/public API:
|
||||||
|
- `magatamallm`
|
||||||
|
- `datasetSource = url`
|
||||||
|
- `collectedExamples = 15679`
|
||||||
|
- `evalExamples = 1743`
|
||||||
|
- `totalExamples = 17422`
|
||||||
|
- `newSinceLastTraining = 15679`
|
||||||
|
- `fo_blogllm`
|
||||||
|
- `datasetSource = url`
|
||||||
|
- `collectedExamples = 17322`
|
||||||
|
- `evalExamples = 1926`
|
||||||
|
- `totalExamples = 19254`
|
||||||
|
- `neverTrained = true`
|
||||||
|
- `tip_llm`
|
||||||
|
- `datasetSource = url`
|
||||||
|
- `collectedExamples = 231`
|
||||||
|
- `evalExamples = 26`
|
||||||
|
- `totalExamples = 257`
|
||||||
|
- `neverTrained = true`
|
||||||
|
- operational conclusion:
|
||||||
|
- lane-specific dataset truth is now real on Erik.
|
||||||
|
- `TIP_LLM` is no longer silently borrowing the FO_Blog behavior lane.
|
||||||
|
- the next remaining hard problem is now RunPod artifact adoption/validation, not lane contamination.
|
||||||
|
|
||||||
- MAGATAMA frontend/runtime consistency was repaired again on 2026-05-06:
|
- MAGATAMA frontend/runtime consistency was repaired again on 2026-05-06:
|
||||||
- dashboard and core were rebuilt locally and redeployed to Erik.
|
- dashboard and core were rebuilt locally and redeployed to Erik.
|
||||||
- live processes restarted successfully:
|
- live processes restarted successfully:
|
||||||
@ -407,6 +457,13 @@ Confirmed on `2026-05-06`:
|
|||||||
- active alias switch
|
- active alias switch
|
||||||
- smoke-test proof
|
- smoke-test proof
|
||||||
has not yet been re-verified after the new adoption pipeline was wired in.
|
has not yet been re-verified after the new adoption pipeline was wired in.
|
||||||
|
- Latest live proof run on `2026-05-06`:
|
||||||
|
- job id: `2112a7ab-68c2-4411-a44f-6edb7ad377df-e1`
|
||||||
|
- materialized correctly
|
||||||
|
- reached `IN_PROGRESS`
|
||||||
|
- then `COMPLETED`
|
||||||
|
- but RunPod `status/{job}` returned no `output` object, no model artifact reference, and no Hugging Face repo result
|
||||||
|
- current MAGATAMA handling now correctly classifies this as `completed_without_model_artifact`, not as success
|
||||||
- `tip_llm-v1` is still not installed locally in Ollama.
|
- `tip_llm-v1` is still not installed locally in Ollama.
|
||||||
|
|
||||||
### Pulso AI Recommendation
|
### Pulso AI Recommendation
|
||||||
|
|||||||
@ -0,0 +1,136 @@
|
|||||||
|
# TIP Lane Detangling And Disk-Safe Refresh
|
||||||
|
|
||||||
|
Date: 2026-05-06 UTC
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
`TIP_LLM` was still contaminated by blog/writer behavior even though lane-specific counts were already separated in MAGATAMA. The problem was not only UI-level status, but the actual lane corpus feeding the RunPod export.
|
||||||
|
|
||||||
|
The lane was rebuilt and revalidated locally, then synced to Erik and refreshed there. The result is that `TIP_LLM` now uses a much smaller but correctly aligned research/network corpus instead of silently inheriting FO_Blog-like behavior.
|
||||||
|
|
||||||
|
## Root Cause
|
||||||
|
|
||||||
|
- The canonical `training-data/gitea-learning-pool/tip_llm/*.jsonl` pool still contained many blog-shaped rows from shared transceiver corpora.
|
||||||
|
- The old TIP export sampled thousands of rows whose prompts/messages still looked like:
|
||||||
|
- `You are an expert technical writer...`
|
||||||
|
- publication-ready/blog instructions
|
||||||
|
- A direct local check on the pre-fix TIP export showed:
|
||||||
|
- `6250` train rows
|
||||||
|
- `6087` matched blog/writer patterns
|
||||||
|
|
||||||
|
## Changes Applied
|
||||||
|
|
||||||
|
### `scripts/runpod_dataset_builder.ts`
|
||||||
|
|
||||||
|
- Added a stricter `tipDatasetAllowed(...)` gate.
|
||||||
|
- Tightened `laneRecordIsCompatible(...)` for `tip_llm`.
|
||||||
|
- Tightened `lanePoolMessagesAlign(...)` for `tip_llm`:
|
||||||
|
- reject:
|
||||||
|
- `blog writer`
|
||||||
|
- `publication-ready`
|
||||||
|
- `technical writer specializing`
|
||||||
|
- article-outline/founder/blog prompts
|
||||||
|
- markdown-article assistant outputs
|
||||||
|
- TIP registry fallback now only considers lane-compatible datasets.
|
||||||
|
|
||||||
|
### `scripts/sync_gitea_training_pool.ts`
|
||||||
|
|
||||||
|
- Applied the same stricter TIP lane-alignment logic.
|
||||||
|
- Stopped rewriting redundant `merged.jsonl` copies for:
|
||||||
|
- `fo_blogllm`
|
||||||
|
- `tip_llm`
|
||||||
|
- This was necessary because the duplicated merged artifacts caused local disk exhaustion during refresh.
|
||||||
|
|
||||||
|
## Disk Incident
|
||||||
|
|
||||||
|
During the first rebuild after the lane hardening, refresh failed with:
|
||||||
|
|
||||||
|
- `ENOSPC: no space left on device`
|
||||||
|
|
||||||
|
The immediate cause was writing:
|
||||||
|
|
||||||
|
- `training-data/gitea-learning-pool/tip_llm/merged.jsonl`
|
||||||
|
|
||||||
|
Fix:
|
||||||
|
|
||||||
|
- truncated redundant `merged` artifacts for `fo_blogllm` and `tip_llm`
|
||||||
|
- changed sync logic so those duplicates are no longer recreated
|
||||||
|
|
||||||
|
Result:
|
||||||
|
|
||||||
|
- free disk space recovered from roughly `377Mi` to `17Gi`
|
||||||
|
|
||||||
|
## Verified Local Result
|
||||||
|
|
||||||
|
After rebuild:
|
||||||
|
|
||||||
|
- `TIP_LLM`
|
||||||
|
- `train = 233`
|
||||||
|
- `eval = 26`
|
||||||
|
- `total = 259`
|
||||||
|
- `blog/writer matches = 0`
|
||||||
|
|
||||||
|
First rows now use the intended TIP instruction style:
|
||||||
|
|
||||||
|
- `You are TIP_LLM, a research and market-intelligence analyst for transceivers, switches, and vendor ecosystems...`
|
||||||
|
|
||||||
|
This confirms the lane is no longer silently shaped like FO_Blog.
|
||||||
|
|
||||||
|
## Synced To Erik
|
||||||
|
|
||||||
|
Synced:
|
||||||
|
|
||||||
|
- updated scripts:
|
||||||
|
- `runpod_dataset_builder.ts`
|
||||||
|
- `sync_gitea_training_pool.ts`
|
||||||
|
- `submit_runpod_training.ts`
|
||||||
|
- rebuilt lane exports:
|
||||||
|
- `training-data/runpod/magatamallm/*`
|
||||||
|
- `training-data/runpod/fo_blogllm/*`
|
||||||
|
- `training-data/runpod/tip_llm/*`
|
||||||
|
|
||||||
|
Then reran on Erik:
|
||||||
|
|
||||||
|
- `pnpm training:refresh-all`
|
||||||
|
|
||||||
|
## Live Erik / Public API Result
|
||||||
|
|
||||||
|
### `magatamallm`
|
||||||
|
|
||||||
|
- `datasetSource = url`
|
||||||
|
- `collectedExamples = 15679`
|
||||||
|
- `evalExamples = 1743`
|
||||||
|
- `totalExamples = 17422`
|
||||||
|
- `newSinceLastTraining = 15679`
|
||||||
|
|
||||||
|
### `fo_blogllm`
|
||||||
|
|
||||||
|
- `datasetSource = url`
|
||||||
|
- `collectedExamples = 17322`
|
||||||
|
- `evalExamples = 1926`
|
||||||
|
- `totalExamples = 19254`
|
||||||
|
- `neverTrained = true`
|
||||||
|
|
||||||
|
### `tip_llm`
|
||||||
|
|
||||||
|
- `datasetSource = url`
|
||||||
|
- `collectedExamples = 231`
|
||||||
|
- `evalExamples = 26`
|
||||||
|
- `totalExamples = 257`
|
||||||
|
- `neverTrained = true`
|
||||||
|
|
||||||
|
## Remaining Work
|
||||||
|
|
||||||
|
The next remaining hard blocker is no longer lane contamination.
|
||||||
|
|
||||||
|
It is now:
|
||||||
|
|
||||||
|
- RunPod artifact validation/adoption
|
||||||
|
|
||||||
|
Desired next step:
|
||||||
|
|
||||||
|
1. only accept RunPod `COMPLETED` as success if a real artifact exists
|
||||||
|
2. verify artifact importability
|
||||||
|
3. update/adopt local Ollama tag automatically
|
||||||
|
4. switch MAGATAMA only after successful adoption
|
||||||
|
5. run pre/post smoke prompts
|
||||||
Loading…
x
Reference in New Issue
Block a user