rene/transceiver-db

Fork 0

Rene Fichtmueller b9a45f9f23 sync: record lane-specific training pools and url runpod mode

2026-05-06 17:55:20 +02:00

3.2 KiB

Raw Blame History

MAGATAMA Lane-Specific Training Pools + URL RunPod Dataset Mode

Date: 2026-05-06 Author: Codex

Problem

The MAGATAMA training modal still showed the magatamallm pool even when the operator selected:

FO_BlogLLM
TIP_LLM

As a result, the UI implied that all training lanes reused the same pool and counts.

At the same time, RunPod launches still depended on Hugging Face dataset publication unless explicitly changed.

Root Cause

The training modal fetched:

/api/llm/status

without a lane parameter.

The backend status route therefore always returned the default magatamallm training corpus/lane.
Dashboard env on Erik was still effectively using the Hugging Face dataset path for RunPod dataset source.

Fix

Lane-aware status

/api/llm/status now accepts the selected lane and returns lane-specific training metadata.

The training modal was updated to:

fetch /api/llm/status?lane=<selected lane>
update title and runtime text per lane
show lane-specific:
- manifest path
- train/eval/total counts
- dataset source

URL dataset mode

The live dashboard environment on Erik was updated through ecosystem.config.cjs:

RUNPOD_DATASET_SOURCE=url
RUNPOD_DATASET_SOURCE_MAGATAMALLM=url
RUNPOD_DATASET_SOURCE_FO_BLOGLLM=url
RUNPOD_DATASET_SOURCE_TIP_LLM=url
MAGATAMA_PUBLIC_BASE_URL=https://magatama.fichtmueller.org

Then magatama-dashboard was restarted with --update-env.

Live Verification

Verified directly on Erik through:

http://127.0.0.1:3211/api/llm/status?lane=fo_blogllm
http://127.0.0.1:3211/api/llm/status?lane=tip_llm

`fo_blogllm`

datasetSource = url
collectionsPath = /opt/magatama/training-data/runpod/fo_blogllm/manifest.json
trainFile = /opt/magatama/training-data/runpod/fo_blogllm/fo_blogllm-sft-train.jsonl
validFile = /opt/magatama/training-data/runpod/fo_blogllm/fo_blogllm-sft-eval.jsonl
collectedExamples = 28
evalExamples = 4
totalExamples = 32

`tip_llm`

datasetSource = url
collectionsPath = /opt/magatama/training-data/runpod/tip_llm/manifest.json
trainFile = /opt/magatama/training-data/runpod/tip_llm/tip_llm-sft-train.jsonl
validFile = /opt/magatama/training-data/runpod/tip_llm/tip_llm-sft-eval.jsonl
collectedExamples = 36
evalExamples = 4
totalExamples = 40

`magatamallm`

Still correctly shows the larger lane export:

collectedExamples = 15620
evalExamples = 1736
totalExamples = 17356

Operational Meaning

MAGATAMA training is now materially closer to the intended fully automated flow:

each LLM lane shows and uses its own pool
RunPod dataset preparation no longer requires Hugging Face dataset publication
dataset fetch comes from MAGATAMA URL-bundle / lane export

This removes one major manual/external blocker from the RunPod training path.

Remaining Truth

This fix does not automatically prove that every RunPod worker run itself succeeds end-to-end.

What is fixed:

lane-specific training pool selection
lane-specific UI/status
URL dataset source activation

What still depends on RunPod worker behavior:

real successful training execution
durable model artifact production
artifact adoption after completion

3.2 KiB Raw Blame History