From d01039734a61f19adc8ef401427188a31d094e40 Mon Sep 17 00:00:00 2001 From: Rene Fichtmueller Date: Wed, 6 May 2026 22:53:41 +0200 Subject: [PATCH] sync: record tip lane detangling and disk-safe refresh --- sync/CURRENT.md | 59 +++++++- ...p-lane-detangling-and-disk-safe-refresh.md | 136 ++++++++++++++++++ 2 files changed, 194 insertions(+), 1 deletion(-) create mode 100644 sync/history/2026-05-06-tip-lane-detangling-and-disk-safe-refresh.md diff --git a/sync/CURRENT.md b/sync/CURRENT.md index 012fb51..b3f6677 100644 --- a/sync/CURRENT.md +++ b/sync/CURRENT.md @@ -1,6 +1,6 @@ # Current TIP Sync State -Updated: 2026-05-06 15:48 UTC +Updated: 2026-05-06 20:52 UTC ## Active Policy @@ -27,6 +27,56 @@ When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infr ## Latest Work +- TIP/Blog lane separation was materially corrected on 2026-05-06: + - root cause: + - `TIP_LLM` was still ingesting blog-/writer-shaped rows from the canonical lane pool and shared transceiver corpora. + - local inspection showed the old TIP export had `6250` train rows, of which `6087` still matched blog/writer patterns. + - dataset builder and Gitea sync were hardened: + - `scripts/runpod_dataset_builder.ts` + - added strict `tipDatasetAllowed(...)` + - `TIP_LLM` now rejects blog-shaped source rows at dataset-build time + - `TIP_LLM` now rejects blog-like `system`, `user`, and markdown-article `assistant` patterns + - registry fallback for `TIP_LLM` now only uses lane-compatible datasets + - `scripts/sync_gitea_training_pool.ts` + - canonical TIP pool refresh now uses the stricter lane-alignment rules + - redundant `merged.jsonl` copies for `fo_blogllm` and `tip_llm` are no longer rewritten, to avoid local disk exhaustion from duplicate lane artifacts + - local disk issue encountered and fixed: + - full refresh failed with `ENOSPC` while writing `training-data/gitea-learning-pool/tip_llm/merged.jsonl` + - redundant lane `merged` artifacts for `fo_blogllm` and `tip_llm` were truncated and the sync script was changed to stop recreating them + - free disk space returned from `377Mi` to `17Gi` + - locally verified after rebuild: + - `TIP_LLM` RunPod export: + - `train = 233` + - `eval = 26` + - `total = 259` + - `blog/writer matches = 0` + - first TIP rows now use the correct TIP system prompt: + - `You are TIP_LLM, a research and market-intelligence analyst for transceivers, switches, and vendor ecosystems...` + - corrected artifacts and scripts were synced to Erik and `pnpm training:refresh-all` was rerun there. + - live verified on Erik/public API: + - `magatamallm` + - `datasetSource = url` + - `collectedExamples = 15679` + - `evalExamples = 1743` + - `totalExamples = 17422` + - `newSinceLastTraining = 15679` + - `fo_blogllm` + - `datasetSource = url` + - `collectedExamples = 17322` + - `evalExamples = 1926` + - `totalExamples = 19254` + - `neverTrained = true` + - `tip_llm` + - `datasetSource = url` + - `collectedExamples = 231` + - `evalExamples = 26` + - `totalExamples = 257` + - `neverTrained = true` + - operational conclusion: + - lane-specific dataset truth is now real on Erik. + - `TIP_LLM` is no longer silently borrowing the FO_Blog behavior lane. + - the next remaining hard problem is now RunPod artifact adoption/validation, not lane contamination. + - MAGATAMA frontend/runtime consistency was repaired again on 2026-05-06: - dashboard and core were rebuilt locally and redeployed to Erik. - live processes restarted successfully: @@ -407,6 +457,13 @@ Confirmed on `2026-05-06`: - active alias switch - smoke-test proof has not yet been re-verified after the new adoption pipeline was wired in. +- Latest live proof run on `2026-05-06`: + - job id: `2112a7ab-68c2-4411-a44f-6edb7ad377df-e1` + - materialized correctly + - reached `IN_PROGRESS` + - then `COMPLETED` + - but RunPod `status/{job}` returned no `output` object, no model artifact reference, and no Hugging Face repo result + - current MAGATAMA handling now correctly classifies this as `completed_without_model_artifact`, not as success - `tip_llm-v1` is still not installed locally in Ollama. ### Pulso AI Recommendation diff --git a/sync/history/2026-05-06-tip-lane-detangling-and-disk-safe-refresh.md b/sync/history/2026-05-06-tip-lane-detangling-and-disk-safe-refresh.md new file mode 100644 index 0000000..2c73616 --- /dev/null +++ b/sync/history/2026-05-06-tip-lane-detangling-and-disk-safe-refresh.md @@ -0,0 +1,136 @@ +# TIP Lane Detangling And Disk-Safe Refresh + +Date: 2026-05-06 UTC + +## Summary + +`TIP_LLM` was still contaminated by blog/writer behavior even though lane-specific counts were already separated in MAGATAMA. The problem was not only UI-level status, but the actual lane corpus feeding the RunPod export. + +The lane was rebuilt and revalidated locally, then synced to Erik and refreshed there. The result is that `TIP_LLM` now uses a much smaller but correctly aligned research/network corpus instead of silently inheriting FO_Blog-like behavior. + +## Root Cause + +- The canonical `training-data/gitea-learning-pool/tip_llm/*.jsonl` pool still contained many blog-shaped rows from shared transceiver corpora. +- The old TIP export sampled thousands of rows whose prompts/messages still looked like: + - `You are an expert technical writer...` + - publication-ready/blog instructions +- A direct local check on the pre-fix TIP export showed: + - `6250` train rows + - `6087` matched blog/writer patterns + +## Changes Applied + +### `scripts/runpod_dataset_builder.ts` + +- Added a stricter `tipDatasetAllowed(...)` gate. +- Tightened `laneRecordIsCompatible(...)` for `tip_llm`. +- Tightened `lanePoolMessagesAlign(...)` for `tip_llm`: + - reject: + - `blog writer` + - `publication-ready` + - `technical writer specializing` + - article-outline/founder/blog prompts + - markdown-article assistant outputs +- TIP registry fallback now only considers lane-compatible datasets. + +### `scripts/sync_gitea_training_pool.ts` + +- Applied the same stricter TIP lane-alignment logic. +- Stopped rewriting redundant `merged.jsonl` copies for: + - `fo_blogllm` + - `tip_llm` +- This was necessary because the duplicated merged artifacts caused local disk exhaustion during refresh. + +## Disk Incident + +During the first rebuild after the lane hardening, refresh failed with: + +- `ENOSPC: no space left on device` + +The immediate cause was writing: + +- `training-data/gitea-learning-pool/tip_llm/merged.jsonl` + +Fix: + +- truncated redundant `merged` artifacts for `fo_blogllm` and `tip_llm` +- changed sync logic so those duplicates are no longer recreated + +Result: + +- free disk space recovered from roughly `377Mi` to `17Gi` + +## Verified Local Result + +After rebuild: + +- `TIP_LLM` + - `train = 233` + - `eval = 26` + - `total = 259` + - `blog/writer matches = 0` + +First rows now use the intended TIP instruction style: + +- `You are TIP_LLM, a research and market-intelligence analyst for transceivers, switches, and vendor ecosystems...` + +This confirms the lane is no longer silently shaped like FO_Blog. + +## Synced To Erik + +Synced: + +- updated scripts: + - `runpod_dataset_builder.ts` + - `sync_gitea_training_pool.ts` + - `submit_runpod_training.ts` +- rebuilt lane exports: + - `training-data/runpod/magatamallm/*` + - `training-data/runpod/fo_blogllm/*` + - `training-data/runpod/tip_llm/*` + +Then reran on Erik: + +- `pnpm training:refresh-all` + +## Live Erik / Public API Result + +### `magatamallm` + +- `datasetSource = url` +- `collectedExamples = 15679` +- `evalExamples = 1743` +- `totalExamples = 17422` +- `newSinceLastTraining = 15679` + +### `fo_blogllm` + +- `datasetSource = url` +- `collectedExamples = 17322` +- `evalExamples = 1926` +- `totalExamples = 19254` +- `neverTrained = true` + +### `tip_llm` + +- `datasetSource = url` +- `collectedExamples = 231` +- `evalExamples = 26` +- `totalExamples = 257` +- `neverTrained = true` + +## Remaining Work + +The next remaining hard blocker is no longer lane contamination. + +It is now: + +- RunPod artifact validation/adoption + +Desired next step: + +1. only accept RunPod `COMPLETED` as success if a real artifact exists +2. verify artifact importability +3. update/adopt local Ollama tag automatically +4. switch MAGATAMA only after successful adoption +5. run pre/post smoke prompts