sync: record lane-specific training pools and url runpod mode

2026-05-06 17:55:20 +02:00 · 2026-05-06 17:55:20 +02:00 · b9a45f9f23
commit b9a45f9f23
parent 830ab57c3c
2 changed files with 150 additions and 1 deletions
--- a/sync/CURRENT.md
+++ b/sync/CURRENT.md
@ -1,6 +1,6 @@
 # Current TIP Sync State
-Updated: 2026-05-06 15:24 UTC
+Updated: 2026-05-06 15:48 UTC
 ## Active Policy
@ -65,6 +65,40 @@ When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infr
      - `generatedAt = 2026-05-06T15:18:42.708Z`
      - latest visible entries include `2026-04-30` items again instead of appearing frozen at `30.05`
 - MAGATAMA lane-specific training pools and RunPod dataset automation were finished on 2026-05-06:
  - root cause:
    - the training modal always fetched `/api/llm/status` without a lane, so `FO_BlogLLM` and `TIP_LLM` still showed the `magatamallm` pool.
  - dashboard/server were updated so `/api/llm/status?lane=...` is now truly lane-aware.
  - the training modal now refreshes per selected lane and rewrites:
    - title
    - runtime label
    - pool path
    - counts
    - dataset source
  - MAGATAMA dashboard env on Erik was switched to URL dataset mode for all lanes via `ecosystem.config.cjs`:
    - `RUNPOD_DATASET_SOURCE=url`
    - `RUNPOD_DATASET_SOURCE_MAGATAMALLM=url`
    - `RUNPOD_DATASET_SOURCE_FO_BLOGLLM=url`
    - `RUNPOD_DATASET_SOURCE_TIP_LLM=url`
  - live verified on Erik after restart:
    - `fo_blogllm`
      - `datasetSource = url`
      - `collectionsPath = /opt/magatama/training-data/runpod/fo_blogllm/manifest.json`
      - `train = 28`
      - `eval = 4`
      - `total = 32`
    - `tip_llm`
      - `datasetSource = url`
      - `collectionsPath = /opt/magatama/training-data/runpod/tip_llm/manifest.json`
      - `train = 36`
      - `eval = 4`
      - `total = 40`
    - `magatamallm`
      - remains on lane-export counts (`15620 / 1736 / 17356`)
  - operator impact:
    - no Hugging Face dataset publish is required anymore for MAGATAMA RunPod launches.
    - every supported LLM lane now points to its own local/Gitea-backed lane export instead of reusing `magatamallm`.
 - MAGATAMA training + Attack Paths + Atlas exposure were corrected again on 2026-05-06:
  - the RunPod serverless training start failure was not a RunPod outage.
  - root cause was missing training scripts on Erik (`training_full_refresh.ts` and related helpers were absent under `/opt/magatama/scripts`).
--- a/sync/history/2026-05-06-magatama-lane-specific-training-pools-and-url-runpod.md
+++ b/sync/history/2026-05-06-magatama-lane-specific-training-pools-and-url-runpod.md
@ -0,0 +1,115 @@
 # MAGATAMA Lane-Specific Training Pools + URL RunPod Dataset Mode
 Date: 2026-05-06
 Author: Codex
 ## Problem
 The MAGATAMA training modal still showed the `magatamallm` pool even when the operator selected:
 - `FO_BlogLLM`
 - `TIP_LLM`
 As a result, the UI implied that all training lanes reused the same pool and counts.
 At the same time, RunPod launches still depended on Hugging Face dataset publication unless explicitly changed.
 ## Root Cause
 1. The training modal fetched:
 - `/api/llm/status`
 without a lane parameter.
 2. The backend status route therefore always returned the default `magatamallm` training corpus/lane.
 3. Dashboard env on Erik was still effectively using the Hugging Face dataset path for RunPod dataset source.
 ## Fix
 ### Lane-aware status
 `/api/llm/status` now accepts the selected lane and returns lane-specific training metadata.
 The training modal was updated to:
 - fetch `/api/llm/status?lane=<selected lane>`
 - update title and runtime text per lane
 - show lane-specific:
  - manifest path
  - train/eval/total counts
  - dataset source
 ### URL dataset mode
 The live dashboard environment on Erik was updated through `ecosystem.config.cjs`:
 - `RUNPOD_DATASET_SOURCE=url`
 - `RUNPOD_DATASET_SOURCE_MAGATAMALLM=url`
 - `RUNPOD_DATASET_SOURCE_FO_BLOGLLM=url`
 - `RUNPOD_DATASET_SOURCE_TIP_LLM=url`
 - `MAGATAMA_PUBLIC_BASE_URL=https://magatama.fichtmueller.org`
 Then `magatama-dashboard` was restarted with `--update-env`.
 ## Live Verification
 Verified directly on Erik through:
 - `http://127.0.0.1:3211/api/llm/status?lane=fo_blogllm`
 - `http://127.0.0.1:3211/api/llm/status?lane=tip_llm`
 ### `fo_blogllm`
 - `datasetSource = url`
 - `collectionsPath = /opt/magatama/training-data/runpod/fo_blogllm/manifest.json`
 - `trainFile = /opt/magatama/training-data/runpod/fo_blogllm/fo_blogllm-sft-train.jsonl`
 - `validFile = /opt/magatama/training-data/runpod/fo_blogllm/fo_blogllm-sft-eval.jsonl`
 - `collectedExamples = 28`
 - `evalExamples = 4`
 - `totalExamples = 32`
 ### `tip_llm`
 - `datasetSource = url`
 - `collectionsPath = /opt/magatama/training-data/runpod/tip_llm/manifest.json`
 - `trainFile = /opt/magatama/training-data/runpod/tip_llm/tip_llm-sft-train.jsonl`
 - `validFile = /opt/magatama/training-data/runpod/tip_llm/tip_llm-sft-eval.jsonl`
 - `collectedExamples = 36`
 - `evalExamples = 4`
 - `totalExamples = 40`
 ### `magatamallm`
 Still correctly shows the larger lane export:
 - `collectedExamples = 15620`
 - `evalExamples = 1736`
 - `totalExamples = 17356`
 ## Operational Meaning
 MAGATAMA training is now materially closer to the intended fully automated flow:
 - each LLM lane shows and uses its own pool
 - RunPod dataset preparation no longer requires Hugging Face dataset publication
 - dataset fetch comes from MAGATAMA URL-bundle / lane export
 This removes one major manual/external blocker from the RunPod training path.
 ## Remaining Truth
 This fix does **not** automatically prove that every RunPod worker run itself succeeds end-to-end.
 What is fixed:
 - lane-specific training pool selection
 - lane-specific UI/status
 - URL dataset source activation
 What still depends on RunPod worker behavior:
 - real successful training execution
 - durable model artifact production
 - artifact adoption after completion