116 lines
3.2 KiB
Markdown
116 lines
3.2 KiB
Markdown
# MAGATAMA Lane-Specific Training Pools + URL RunPod Dataset Mode
|
|
|
|
Date: 2026-05-06
|
|
Author: Codex
|
|
|
|
## Problem
|
|
|
|
The MAGATAMA training modal still showed the `magatamallm` pool even when the operator selected:
|
|
|
|
- `FO_BlogLLM`
|
|
- `TIP_LLM`
|
|
|
|
As a result, the UI implied that all training lanes reused the same pool and counts.
|
|
|
|
At the same time, RunPod launches still depended on Hugging Face dataset publication unless explicitly changed.
|
|
|
|
## Root Cause
|
|
|
|
1. The training modal fetched:
|
|
|
|
- `/api/llm/status`
|
|
|
|
without a lane parameter.
|
|
|
|
2. The backend status route therefore always returned the default `magatamallm` training corpus/lane.
|
|
|
|
3. Dashboard env on Erik was still effectively using the Hugging Face dataset path for RunPod dataset source.
|
|
|
|
## Fix
|
|
|
|
### Lane-aware status
|
|
|
|
`/api/llm/status` now accepts the selected lane and returns lane-specific training metadata.
|
|
|
|
The training modal was updated to:
|
|
|
|
- fetch `/api/llm/status?lane=<selected lane>`
|
|
- update title and runtime text per lane
|
|
- show lane-specific:
|
|
- manifest path
|
|
- train/eval/total counts
|
|
- dataset source
|
|
|
|
### URL dataset mode
|
|
|
|
The live dashboard environment on Erik was updated through `ecosystem.config.cjs`:
|
|
|
|
- `RUNPOD_DATASET_SOURCE=url`
|
|
- `RUNPOD_DATASET_SOURCE_MAGATAMALLM=url`
|
|
- `RUNPOD_DATASET_SOURCE_FO_BLOGLLM=url`
|
|
- `RUNPOD_DATASET_SOURCE_TIP_LLM=url`
|
|
- `MAGATAMA_PUBLIC_BASE_URL=https://magatama.fichtmueller.org`
|
|
|
|
Then `magatama-dashboard` was restarted with `--update-env`.
|
|
|
|
## Live Verification
|
|
|
|
Verified directly on Erik through:
|
|
|
|
- `http://127.0.0.1:3211/api/llm/status?lane=fo_blogllm`
|
|
- `http://127.0.0.1:3211/api/llm/status?lane=tip_llm`
|
|
|
|
### `fo_blogllm`
|
|
|
|
- `datasetSource = url`
|
|
- `collectionsPath = /opt/magatama/training-data/runpod/fo_blogllm/manifest.json`
|
|
- `trainFile = /opt/magatama/training-data/runpod/fo_blogllm/fo_blogllm-sft-train.jsonl`
|
|
- `validFile = /opt/magatama/training-data/runpod/fo_blogllm/fo_blogllm-sft-eval.jsonl`
|
|
- `collectedExamples = 28`
|
|
- `evalExamples = 4`
|
|
- `totalExamples = 32`
|
|
|
|
### `tip_llm`
|
|
|
|
- `datasetSource = url`
|
|
- `collectionsPath = /opt/magatama/training-data/runpod/tip_llm/manifest.json`
|
|
- `trainFile = /opt/magatama/training-data/runpod/tip_llm/tip_llm-sft-train.jsonl`
|
|
- `validFile = /opt/magatama/training-data/runpod/tip_llm/tip_llm-sft-eval.jsonl`
|
|
- `collectedExamples = 36`
|
|
- `evalExamples = 4`
|
|
- `totalExamples = 40`
|
|
|
|
### `magatamallm`
|
|
|
|
Still correctly shows the larger lane export:
|
|
|
|
- `collectedExamples = 15620`
|
|
- `evalExamples = 1736`
|
|
- `totalExamples = 17356`
|
|
|
|
## Operational Meaning
|
|
|
|
MAGATAMA training is now materially closer to the intended fully automated flow:
|
|
|
|
- each LLM lane shows and uses its own pool
|
|
- RunPod dataset preparation no longer requires Hugging Face dataset publication
|
|
- dataset fetch comes from MAGATAMA URL-bundle / lane export
|
|
|
|
This removes one major manual/external blocker from the RunPod training path.
|
|
|
|
## Remaining Truth
|
|
|
|
This fix does **not** automatically prove that every RunPod worker run itself succeeds end-to-end.
|
|
|
|
What is fixed:
|
|
|
|
- lane-specific training pool selection
|
|
- lane-specific UI/status
|
|
- URL dataset source activation
|
|
|
|
What still depends on RunPod worker behavior:
|
|
|
|
- real successful training execution
|
|
- durable model artifact production
|
|
- artifact adoption after completion
|