sync: record lane-specific training pools and url runpod mode

2026-05-06 17:55:20 +02:00 · 2026-05-06 17:55:20 +02:00 · b9a45f9f23
commit b9a45f9f23
parent 830ab57c3c
2 changed files with 150 additions and 1 deletions
--- a/sync/CURRENT.md
+++ b/sync/CURRENT.md
@ -1,6 +1,6 @@
 # Current TIP Sync State

-Updated: 2026-05-06 15:24 UTC
+Updated: 2026-05-06 15:48 UTC

 ## Active Policy

@ -65,6 +65,40 @@ When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infr
      - `generatedAt = 2026-05-06T15:18:42.708Z`
      - latest visible entries include `2026-04-30` items again instead of appearing frozen at `30.05`

+- MAGATAMA lane-specific training pools and RunPod dataset automation were finished on 2026-05-06:
+  - root cause:
+    - the training modal always fetched `/api/llm/status` without a lane, so `FO_BlogLLM` and `TIP_LLM` still showed the `magatamallm` pool.
+  - dashboard/server were updated so `/api/llm/status?lane=...` is now truly lane-aware.
+  - the training modal now refreshes per selected lane and rewrites:
+    - title
+    - runtime label
+    - pool path
+    - counts
+    - dataset source
+  - MAGATAMA dashboard env on Erik was switched to URL dataset mode for all lanes via `ecosystem.config.cjs`:
+    - `RUNPOD_DATASET_SOURCE=url`
+    - `RUNPOD_DATASET_SOURCE_MAGATAMALLM=url`
+    - `RUNPOD_DATASET_SOURCE_FO_BLOGLLM=url`
+    - `RUNPOD_DATASET_SOURCE_TIP_LLM=url`
+  - live verified on Erik after restart:
+    - `fo_blogllm`
+      - `datasetSource = url`
+      - `collectionsPath = /opt/magatama/training-data/runpod/fo_blogllm/manifest.json`
+      - `train = 28`
+      - `eval = 4`
+      - `total = 32`
+    - `tip_llm`
+      - `datasetSource = url`
+      - `collectionsPath = /opt/magatama/training-data/runpod/tip_llm/manifest.json`
+      - `train = 36`
+      - `eval = 4`
+      - `total = 40`
+    - `magatamallm`
+      - remains on lane-export counts (`15620 / 1736 / 17356`)
+  - operator impact:
+    - no Hugging Face dataset publish is required anymore for MAGATAMA RunPod launches.
+    - every supported LLM lane now points to its own local/Gitea-backed lane export instead of reusing `magatamallm`.
+
 - MAGATAMA training + Attack Paths + Atlas exposure were corrected again on 2026-05-06:
  - the RunPod serverless training start failure was not a RunPod outage.
  - root cause was missing training scripts on Erik (`training_full_refresh.ts` and related helpers were absent under `/opt/magatama/scripts`).
--- a/sync/history/2026-05-06-magatama-lane-specific-training-pools-and-url-runpod.md
+++ b/sync/history/2026-05-06-magatama-lane-specific-training-pools-and-url-runpod.md
@ -0,0 +1,115 @@
+# MAGATAMA Lane-Specific Training Pools + URL RunPod Dataset Mode
+
+Date: 2026-05-06
+Author: Codex
+
+## Problem
+
+The MAGATAMA training modal still showed the `magatamallm` pool even when the operator selected:
+
+- `FO_BlogLLM`
+- `TIP_LLM`
+
+As a result, the UI implied that all training lanes reused the same pool and counts.
+
+At the same time, RunPod launches still depended on Hugging Face dataset publication unless explicitly changed.
+
+## Root Cause
+
+1. The training modal fetched:
+
+- `/api/llm/status`
+
+without a lane parameter.
+
+2. The backend status route therefore always returned the default `magatamallm` training corpus/lane.
+
+3. Dashboard env on Erik was still effectively using the Hugging Face dataset path for RunPod dataset source.
+
+## Fix
+
+### Lane-aware status
+
+`/api/llm/status` now accepts the selected lane and returns lane-specific training metadata.
+
+The training modal was updated to:
+
+- fetch `/api/llm/status?lane=<selected lane>`
+- update title and runtime text per lane
+- show lane-specific:
+  - manifest path
+  - train/eval/total counts
+  - dataset source
+
+### URL dataset mode
+
+The live dashboard environment on Erik was updated through `ecosystem.config.cjs`:
+
+- `RUNPOD_DATASET_SOURCE=url`
+- `RUNPOD_DATASET_SOURCE_MAGATAMALLM=url`
+- `RUNPOD_DATASET_SOURCE_FO_BLOGLLM=url`
+- `RUNPOD_DATASET_SOURCE_TIP_LLM=url`
+- `MAGATAMA_PUBLIC_BASE_URL=https://magatama.fichtmueller.org`
+
+Then `magatama-dashboard` was restarted with `--update-env`.
+
+## Live Verification
+
+Verified directly on Erik through:
+
+- `http://127.0.0.1:3211/api/llm/status?lane=fo_blogllm`
+- `http://127.0.0.1:3211/api/llm/status?lane=tip_llm`
+
+### `fo_blogllm`
+
+- `datasetSource = url`
+- `collectionsPath = /opt/magatama/training-data/runpod/fo_blogllm/manifest.json`
+- `trainFile = /opt/magatama/training-data/runpod/fo_blogllm/fo_blogllm-sft-train.jsonl`
+- `validFile = /opt/magatama/training-data/runpod/fo_blogllm/fo_blogllm-sft-eval.jsonl`
+- `collectedExamples = 28`
+- `evalExamples = 4`
+- `totalExamples = 32`
+
+### `tip_llm`
+
+- `datasetSource = url`
+- `collectionsPath = /opt/magatama/training-data/runpod/tip_llm/manifest.json`
+- `trainFile = /opt/magatama/training-data/runpod/tip_llm/tip_llm-sft-train.jsonl`
+- `validFile = /opt/magatama/training-data/runpod/tip_llm/tip_llm-sft-eval.jsonl`
+- `collectedExamples = 36`
+- `evalExamples = 4`
+- `totalExamples = 40`
+
+### `magatamallm`
+
+Still correctly shows the larger lane export:
+
+- `collectedExamples = 15620`
+- `evalExamples = 1736`
+- `totalExamples = 17356`
+
+## Operational Meaning
+
+MAGATAMA training is now materially closer to the intended fully automated flow:
+
+- each LLM lane shows and uses its own pool
+- RunPod dataset preparation no longer requires Hugging Face dataset publication
+- dataset fetch comes from MAGATAMA URL-bundle / lane export
+
+This removes one major manual/external blocker from the RunPod training path.
+
+## Remaining Truth
+
+This fix does **not** automatically prove that every RunPod worker run itself succeeds end-to-end.
+
+What is fixed:
+
+- lane-specific training pool selection
+- lane-specific UI/status
+- URL dataset source activation
+
+What still depends on RunPod worker behavior:
+
+- real successful training execution
+- durable model artifact production
+- artifact adoption after completion