transceiver-db/sync/history/2026-05-06-magatama-lane-specific-training-pools-and-url-runpod.md

116 lines
3.2 KiB
Markdown

# MAGATAMA Lane-Specific Training Pools + URL RunPod Dataset Mode
Date: 2026-05-06
Author: Codex
## Problem
The MAGATAMA training modal still showed the `magatamallm` pool even when the operator selected:
- `FO_BlogLLM`
- `TIP_LLM`
As a result, the UI implied that all training lanes reused the same pool and counts.
At the same time, RunPod launches still depended on Hugging Face dataset publication unless explicitly changed.
## Root Cause
1. The training modal fetched:
- `/api/llm/status`
without a lane parameter.
2. The backend status route therefore always returned the default `magatamallm` training corpus/lane.
3. Dashboard env on Erik was still effectively using the Hugging Face dataset path for RunPod dataset source.
## Fix
### Lane-aware status
`/api/llm/status` now accepts the selected lane and returns lane-specific training metadata.
The training modal was updated to:
- fetch `/api/llm/status?lane=<selected lane>`
- update title and runtime text per lane
- show lane-specific:
- manifest path
- train/eval/total counts
- dataset source
### URL dataset mode
The live dashboard environment on Erik was updated through `ecosystem.config.cjs`:
- `RUNPOD_DATASET_SOURCE=url`
- `RUNPOD_DATASET_SOURCE_MAGATAMALLM=url`
- `RUNPOD_DATASET_SOURCE_FO_BLOGLLM=url`
- `RUNPOD_DATASET_SOURCE_TIP_LLM=url`
- `MAGATAMA_PUBLIC_BASE_URL=https://magatama.fichtmueller.org`
Then `magatama-dashboard` was restarted with `--update-env`.
## Live Verification
Verified directly on Erik through:
- `http://127.0.0.1:3211/api/llm/status?lane=fo_blogllm`
- `http://127.0.0.1:3211/api/llm/status?lane=tip_llm`
### `fo_blogllm`
- `datasetSource = url`
- `collectionsPath = /opt/magatama/training-data/runpod/fo_blogllm/manifest.json`
- `trainFile = /opt/magatama/training-data/runpod/fo_blogllm/fo_blogllm-sft-train.jsonl`
- `validFile = /opt/magatama/training-data/runpod/fo_blogllm/fo_blogllm-sft-eval.jsonl`
- `collectedExamples = 28`
- `evalExamples = 4`
- `totalExamples = 32`
### `tip_llm`
- `datasetSource = url`
- `collectionsPath = /opt/magatama/training-data/runpod/tip_llm/manifest.json`
- `trainFile = /opt/magatama/training-data/runpod/tip_llm/tip_llm-sft-train.jsonl`
- `validFile = /opt/magatama/training-data/runpod/tip_llm/tip_llm-sft-eval.jsonl`
- `collectedExamples = 36`
- `evalExamples = 4`
- `totalExamples = 40`
### `magatamallm`
Still correctly shows the larger lane export:
- `collectedExamples = 15620`
- `evalExamples = 1736`
- `totalExamples = 17356`
## Operational Meaning
MAGATAMA training is now materially closer to the intended fully automated flow:
- each LLM lane shows and uses its own pool
- RunPod dataset preparation no longer requires Hugging Face dataset publication
- dataset fetch comes from MAGATAMA URL-bundle / lane export
This removes one major manual/external blocker from the RunPod training path.
## Remaining Truth
This fix does **not** automatically prove that every RunPod worker run itself succeeds end-to-end.
What is fixed:
- lane-specific training pool selection
- lane-specific UI/status
- URL dataset source activation
What still depends on RunPod worker behavior:
- real successful training execution
- durable model artifact production
- artifact adoption after completion