sync: record tip lane detangling and disk-safe refresh

2026-05-06 22:53:41 +02:00 · 2026-05-06 22:53:41 +02:00 · d01039734a
commit d01039734a
parent e6f98c89bd
2 changed files with 194 additions and 1 deletions
--- a/sync/CURRENT.md
+++ b/sync/CURRENT.md
@ -1,6 +1,6 @@
 # Current TIP Sync State

-Updated: 2026-05-06 15:48 UTC
+Updated: 2026-05-06 20:52 UTC

 ## Active Policy

@ -27,6 +27,56 @@ When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infr

 ## Latest Work

+- TIP/Blog lane separation was materially corrected on 2026-05-06:
+  - root cause:
+    - `TIP_LLM` was still ingesting blog-/writer-shaped rows from the canonical lane pool and shared transceiver corpora.
+    - local inspection showed the old TIP export had `6250` train rows, of which `6087` still matched blog/writer patterns.
+  - dataset builder and Gitea sync were hardened:
+    - `scripts/runpod_dataset_builder.ts`
+      - added strict `tipDatasetAllowed(...)`
+      - `TIP_LLM` now rejects blog-shaped source rows at dataset-build time
+      - `TIP_LLM` now rejects blog-like `system`, `user`, and markdown-article `assistant` patterns
+      - registry fallback for `TIP_LLM` now only uses lane-compatible datasets
+    - `scripts/sync_gitea_training_pool.ts`
+      - canonical TIP pool refresh now uses the stricter lane-alignment rules
+      - redundant `merged.jsonl` copies for `fo_blogllm` and `tip_llm` are no longer rewritten, to avoid local disk exhaustion from duplicate lane artifacts
+  - local disk issue encountered and fixed:
+    - full refresh failed with `ENOSPC` while writing `training-data/gitea-learning-pool/tip_llm/merged.jsonl`
+    - redundant lane `merged` artifacts for `fo_blogllm` and `tip_llm` were truncated and the sync script was changed to stop recreating them
+    - free disk space returned from `377Mi` to `17Gi`
+  - locally verified after rebuild:
+    - `TIP_LLM` RunPod export:
+      - `train = 233`
+      - `eval = 26`
+      - `total = 259`
+      - `blog/writer matches = 0`
+    - first TIP rows now use the correct TIP system prompt:
+      - `You are TIP_LLM, a research and market-intelligence analyst for transceivers, switches, and vendor ecosystems...`
+  - corrected artifacts and scripts were synced to Erik and `pnpm training:refresh-all` was rerun there.
+  - live verified on Erik/public API:
+    - `magatamallm`
+      - `datasetSource = url`
+      - `collectedExamples = 15679`
+      - `evalExamples = 1743`
+      - `totalExamples = 17422`
+      - `newSinceLastTraining = 15679`
+    - `fo_blogllm`
+      - `datasetSource = url`
+      - `collectedExamples = 17322`
+      - `evalExamples = 1926`
+      - `totalExamples = 19254`
+      - `neverTrained = true`
+    - `tip_llm`
+      - `datasetSource = url`
+      - `collectedExamples = 231`
+      - `evalExamples = 26`
+      - `totalExamples = 257`
+      - `neverTrained = true`
+  - operational conclusion:
+    - lane-specific dataset truth is now real on Erik.
+    - `TIP_LLM` is no longer silently borrowing the FO_Blog behavior lane.
+    - the next remaining hard problem is now RunPod artifact adoption/validation, not lane contamination.
+
 - MAGATAMA frontend/runtime consistency was repaired again on 2026-05-06:
  - dashboard and core were rebuilt locally and redeployed to Erik.
  - live processes restarted successfully:
@ -407,6 +457,13 @@ Confirmed on `2026-05-06`:
  - active alias switch
  - smoke-test proof
  has not yet been re-verified after the new adoption pipeline was wired in.
+- Latest live proof run on `2026-05-06`:
+  - job id: `2112a7ab-68c2-4411-a44f-6edb7ad377df-e1`
+  - materialized correctly
+  - reached `IN_PROGRESS`
+  - then `COMPLETED`
+  - but RunPod `status/{job}` returned no `output` object, no model artifact reference, and no Hugging Face repo result
+  - current MAGATAMA handling now correctly classifies this as `completed_without_model_artifact`, not as success
 - `tip_llm-v1` is still not installed locally in Ollama.

 ### Pulso AI Recommendation
--- a/sync/history/2026-05-06-tip-lane-detangling-and-disk-safe-refresh.md
+++ b/sync/history/2026-05-06-tip-lane-detangling-and-disk-safe-refresh.md
@ -0,0 +1,136 @@
+# TIP Lane Detangling And Disk-Safe Refresh
+
+Date: 2026-05-06 UTC
+
+## Summary
+
+`TIP_LLM` was still contaminated by blog/writer behavior even though lane-specific counts were already separated in MAGATAMA. The problem was not only UI-level status, but the actual lane corpus feeding the RunPod export.
+
+The lane was rebuilt and revalidated locally, then synced to Erik and refreshed there. The result is that `TIP_LLM` now uses a much smaller but correctly aligned research/network corpus instead of silently inheriting FO_Blog-like behavior.
+
+## Root Cause
+
+- The canonical `training-data/gitea-learning-pool/tip_llm/*.jsonl` pool still contained many blog-shaped rows from shared transceiver corpora.
+- The old TIP export sampled thousands of rows whose prompts/messages still looked like:
+  - `You are an expert technical writer...`
+  - publication-ready/blog instructions
+- A direct local check on the pre-fix TIP export showed:
+  - `6250` train rows
+  - `6087` matched blog/writer patterns
+
+## Changes Applied
+
+### `scripts/runpod_dataset_builder.ts`
+
+- Added a stricter `tipDatasetAllowed(...)` gate.
+- Tightened `laneRecordIsCompatible(...)` for `tip_llm`.
+- Tightened `lanePoolMessagesAlign(...)` for `tip_llm`:
+  - reject:
+    - `blog writer`
+    - `publication-ready`
+    - `technical writer specializing`
+    - article-outline/founder/blog prompts
+    - markdown-article assistant outputs
+- TIP registry fallback now only considers lane-compatible datasets.
+
+### `scripts/sync_gitea_training_pool.ts`
+
+- Applied the same stricter TIP lane-alignment logic.
+- Stopped rewriting redundant `merged.jsonl` copies for:
+  - `fo_blogllm`
+  - `tip_llm`
+- This was necessary because the duplicated merged artifacts caused local disk exhaustion during refresh.
+
+## Disk Incident
+
+During the first rebuild after the lane hardening, refresh failed with:
+
+- `ENOSPC: no space left on device`
+
+The immediate cause was writing:
+
+- `training-data/gitea-learning-pool/tip_llm/merged.jsonl`
+
+Fix:
+
+- truncated redundant `merged` artifacts for `fo_blogllm` and `tip_llm`
+- changed sync logic so those duplicates are no longer recreated
+
+Result:
+
+- free disk space recovered from roughly `377Mi` to `17Gi`
+
+## Verified Local Result
+
+After rebuild:
+
+- `TIP_LLM`
+  - `train = 233`
+  - `eval = 26`
+  - `total = 259`
+  - `blog/writer matches = 0`
+
+First rows now use the intended TIP instruction style:
+
+- `You are TIP_LLM, a research and market-intelligence analyst for transceivers, switches, and vendor ecosystems...`
+
+This confirms the lane is no longer silently shaped like FO_Blog.
+
+## Synced To Erik
+
+Synced:
+
+- updated scripts:
+  - `runpod_dataset_builder.ts`
+  - `sync_gitea_training_pool.ts`
+  - `submit_runpod_training.ts`
+- rebuilt lane exports:
+  - `training-data/runpod/magatamallm/*`
+  - `training-data/runpod/fo_blogllm/*`
+  - `training-data/runpod/tip_llm/*`
+
+Then reran on Erik:
+
+- `pnpm training:refresh-all`
+
+## Live Erik / Public API Result
+
+### `magatamallm`
+
+- `datasetSource = url`
+- `collectedExamples = 15679`
+- `evalExamples = 1743`
+- `totalExamples = 17422`
+- `newSinceLastTraining = 15679`
+
+### `fo_blogllm`
+
+- `datasetSource = url`
+- `collectedExamples = 17322`
+- `evalExamples = 1926`
+- `totalExamples = 19254`
+- `neverTrained = true`
+
+### `tip_llm`
+
+- `datasetSource = url`
+- `collectedExamples = 231`
+- `evalExamples = 26`
+- `totalExamples = 257`
+- `neverTrained = true`
+
+## Remaining Work
+
+The next remaining hard blocker is no longer lane contamination.
+
+It is now:
+
+- RunPod artifact validation/adoption
+
+Desired next step:
+
+1. only accept RunPod `COMPLETED` as success if a real artifact exists
+2. verify artifact importability
+3. update/adopt local Ollama tag automatically
+4. switch MAGATAMA only after successful adoption
+5. run pre/post smoke prompts