sync: record lane-specific training pools and url runpod mode

This commit is contained in:
Rene Fichtmueller 2026-05-06 17:55:20 +02:00
parent 830ab57c3c
commit b9a45f9f23
2 changed files with 150 additions and 1 deletions

View File

@ -1,6 +1,6 @@
# Current TIP Sync State
Updated: 2026-05-06 15:24 UTC
Updated: 2026-05-06 15:48 UTC
## Active Policy
@ -65,6 +65,40 @@ When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infr
- `generatedAt = 2026-05-06T15:18:42.708Z`
- latest visible entries include `2026-04-30` items again instead of appearing frozen at `30.05`
- MAGATAMA lane-specific training pools and RunPod dataset automation were finished on 2026-05-06:
- root cause:
- the training modal always fetched `/api/llm/status` without a lane, so `FO_BlogLLM` and `TIP_LLM` still showed the `magatamallm` pool.
- dashboard/server were updated so `/api/llm/status?lane=...` is now truly lane-aware.
- the training modal now refreshes per selected lane and rewrites:
- title
- runtime label
- pool path
- counts
- dataset source
- MAGATAMA dashboard env on Erik was switched to URL dataset mode for all lanes via `ecosystem.config.cjs`:
- `RUNPOD_DATASET_SOURCE=url`
- `RUNPOD_DATASET_SOURCE_MAGATAMALLM=url`
- `RUNPOD_DATASET_SOURCE_FO_BLOGLLM=url`
- `RUNPOD_DATASET_SOURCE_TIP_LLM=url`
- live verified on Erik after restart:
- `fo_blogllm`
- `datasetSource = url`
- `collectionsPath = /opt/magatama/training-data/runpod/fo_blogllm/manifest.json`
- `train = 28`
- `eval = 4`
- `total = 32`
- `tip_llm`
- `datasetSource = url`
- `collectionsPath = /opt/magatama/training-data/runpod/tip_llm/manifest.json`
- `train = 36`
- `eval = 4`
- `total = 40`
- `magatamallm`
- remains on lane-export counts (`15620 / 1736 / 17356`)
- operator impact:
- no Hugging Face dataset publish is required anymore for MAGATAMA RunPod launches.
- every supported LLM lane now points to its own local/Gitea-backed lane export instead of reusing `magatamallm`.
- MAGATAMA training + Attack Paths + Atlas exposure were corrected again on 2026-05-06:
- the RunPod serverless training start failure was not a RunPod outage.
- root cause was missing training scripts on Erik (`training_full_refresh.ts` and related helpers were absent under `/opt/magatama/scripts`).

View File

@ -0,0 +1,115 @@
# MAGATAMA Lane-Specific Training Pools + URL RunPod Dataset Mode
Date: 2026-05-06
Author: Codex
## Problem
The MAGATAMA training modal still showed the `magatamallm` pool even when the operator selected:
- `FO_BlogLLM`
- `TIP_LLM`
As a result, the UI implied that all training lanes reused the same pool and counts.
At the same time, RunPod launches still depended on Hugging Face dataset publication unless explicitly changed.
## Root Cause
1. The training modal fetched:
- `/api/llm/status`
without a lane parameter.
2. The backend status route therefore always returned the default `magatamallm` training corpus/lane.
3. Dashboard env on Erik was still effectively using the Hugging Face dataset path for RunPod dataset source.
## Fix
### Lane-aware status
`/api/llm/status` now accepts the selected lane and returns lane-specific training metadata.
The training modal was updated to:
- fetch `/api/llm/status?lane=<selected lane>`
- update title and runtime text per lane
- show lane-specific:
- manifest path
- train/eval/total counts
- dataset source
### URL dataset mode
The live dashboard environment on Erik was updated through `ecosystem.config.cjs`:
- `RUNPOD_DATASET_SOURCE=url`
- `RUNPOD_DATASET_SOURCE_MAGATAMALLM=url`
- `RUNPOD_DATASET_SOURCE_FO_BLOGLLM=url`
- `RUNPOD_DATASET_SOURCE_TIP_LLM=url`
- `MAGATAMA_PUBLIC_BASE_URL=https://magatama.fichtmueller.org`
Then `magatama-dashboard` was restarted with `--update-env`.
## Live Verification
Verified directly on Erik through:
- `http://127.0.0.1:3211/api/llm/status?lane=fo_blogllm`
- `http://127.0.0.1:3211/api/llm/status?lane=tip_llm`
### `fo_blogllm`
- `datasetSource = url`
- `collectionsPath = /opt/magatama/training-data/runpod/fo_blogllm/manifest.json`
- `trainFile = /opt/magatama/training-data/runpod/fo_blogllm/fo_blogllm-sft-train.jsonl`
- `validFile = /opt/magatama/training-data/runpod/fo_blogllm/fo_blogllm-sft-eval.jsonl`
- `collectedExamples = 28`
- `evalExamples = 4`
- `totalExamples = 32`
### `tip_llm`
- `datasetSource = url`
- `collectionsPath = /opt/magatama/training-data/runpod/tip_llm/manifest.json`
- `trainFile = /opt/magatama/training-data/runpod/tip_llm/tip_llm-sft-train.jsonl`
- `validFile = /opt/magatama/training-data/runpod/tip_llm/tip_llm-sft-eval.jsonl`
- `collectedExamples = 36`
- `evalExamples = 4`
- `totalExamples = 40`
### `magatamallm`
Still correctly shows the larger lane export:
- `collectedExamples = 15620`
- `evalExamples = 1736`
- `totalExamples = 17356`
## Operational Meaning
MAGATAMA training is now materially closer to the intended fully automated flow:
- each LLM lane shows and uses its own pool
- RunPod dataset preparation no longer requires Hugging Face dataset publication
- dataset fetch comes from MAGATAMA URL-bundle / lane export
This removes one major manual/external blocker from the RunPod training path.
## Remaining Truth
This fix does **not** automatically prove that every RunPod worker run itself succeeds end-to-end.
What is fixed:
- lane-specific training pool selection
- lane-specific UI/status
- URL dataset source activation
What still depends on RunPod worker behavior:
- real successful training execution
- durable model artifact production
- artifact adoption after completion