sync: record lane-specific training pools and url runpod mode
This commit is contained in:
parent
830ab57c3c
commit
b9a45f9f23
@ -1,6 +1,6 @@
|
|||||||
# Current TIP Sync State
|
# Current TIP Sync State
|
||||||
|
|
||||||
Updated: 2026-05-06 15:24 UTC
|
Updated: 2026-05-06 15:48 UTC
|
||||||
|
|
||||||
## Active Policy
|
## Active Policy
|
||||||
|
|
||||||
@ -65,6 +65,40 @@ When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infr
|
|||||||
- `generatedAt = 2026-05-06T15:18:42.708Z`
|
- `generatedAt = 2026-05-06T15:18:42.708Z`
|
||||||
- latest visible entries include `2026-04-30` items again instead of appearing frozen at `30.05`
|
- latest visible entries include `2026-04-30` items again instead of appearing frozen at `30.05`
|
||||||
|
|
||||||
|
- MAGATAMA lane-specific training pools and RunPod dataset automation were finished on 2026-05-06:
|
||||||
|
- root cause:
|
||||||
|
- the training modal always fetched `/api/llm/status` without a lane, so `FO_BlogLLM` and `TIP_LLM` still showed the `magatamallm` pool.
|
||||||
|
- dashboard/server were updated so `/api/llm/status?lane=...` is now truly lane-aware.
|
||||||
|
- the training modal now refreshes per selected lane and rewrites:
|
||||||
|
- title
|
||||||
|
- runtime label
|
||||||
|
- pool path
|
||||||
|
- counts
|
||||||
|
- dataset source
|
||||||
|
- MAGATAMA dashboard env on Erik was switched to URL dataset mode for all lanes via `ecosystem.config.cjs`:
|
||||||
|
- `RUNPOD_DATASET_SOURCE=url`
|
||||||
|
- `RUNPOD_DATASET_SOURCE_MAGATAMALLM=url`
|
||||||
|
- `RUNPOD_DATASET_SOURCE_FO_BLOGLLM=url`
|
||||||
|
- `RUNPOD_DATASET_SOURCE_TIP_LLM=url`
|
||||||
|
- live verified on Erik after restart:
|
||||||
|
- `fo_blogllm`
|
||||||
|
- `datasetSource = url`
|
||||||
|
- `collectionsPath = /opt/magatama/training-data/runpod/fo_blogllm/manifest.json`
|
||||||
|
- `train = 28`
|
||||||
|
- `eval = 4`
|
||||||
|
- `total = 32`
|
||||||
|
- `tip_llm`
|
||||||
|
- `datasetSource = url`
|
||||||
|
- `collectionsPath = /opt/magatama/training-data/runpod/tip_llm/manifest.json`
|
||||||
|
- `train = 36`
|
||||||
|
- `eval = 4`
|
||||||
|
- `total = 40`
|
||||||
|
- `magatamallm`
|
||||||
|
- remains on lane-export counts (`15620 / 1736 / 17356`)
|
||||||
|
- operator impact:
|
||||||
|
- no Hugging Face dataset publish is required anymore for MAGATAMA RunPod launches.
|
||||||
|
- every supported LLM lane now points to its own local/Gitea-backed lane export instead of reusing `magatamallm`.
|
||||||
|
|
||||||
- MAGATAMA training + Attack Paths + Atlas exposure were corrected again on 2026-05-06:
|
- MAGATAMA training + Attack Paths + Atlas exposure were corrected again on 2026-05-06:
|
||||||
- the RunPod serverless training start failure was not a RunPod outage.
|
- the RunPod serverless training start failure was not a RunPod outage.
|
||||||
- root cause was missing training scripts on Erik (`training_full_refresh.ts` and related helpers were absent under `/opt/magatama/scripts`).
|
- root cause was missing training scripts on Erik (`training_full_refresh.ts` and related helpers were absent under `/opt/magatama/scripts`).
|
||||||
|
|||||||
@ -0,0 +1,115 @@
|
|||||||
|
# MAGATAMA Lane-Specific Training Pools + URL RunPod Dataset Mode
|
||||||
|
|
||||||
|
Date: 2026-05-06
|
||||||
|
Author: Codex
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
The MAGATAMA training modal still showed the `magatamallm` pool even when the operator selected:
|
||||||
|
|
||||||
|
- `FO_BlogLLM`
|
||||||
|
- `TIP_LLM`
|
||||||
|
|
||||||
|
As a result, the UI implied that all training lanes reused the same pool and counts.
|
||||||
|
|
||||||
|
At the same time, RunPod launches still depended on Hugging Face dataset publication unless explicitly changed.
|
||||||
|
|
||||||
|
## Root Cause
|
||||||
|
|
||||||
|
1. The training modal fetched:
|
||||||
|
|
||||||
|
- `/api/llm/status`
|
||||||
|
|
||||||
|
without a lane parameter.
|
||||||
|
|
||||||
|
2. The backend status route therefore always returned the default `magatamallm` training corpus/lane.
|
||||||
|
|
||||||
|
3. Dashboard env on Erik was still effectively using the Hugging Face dataset path for RunPod dataset source.
|
||||||
|
|
||||||
|
## Fix
|
||||||
|
|
||||||
|
### Lane-aware status
|
||||||
|
|
||||||
|
`/api/llm/status` now accepts the selected lane and returns lane-specific training metadata.
|
||||||
|
|
||||||
|
The training modal was updated to:
|
||||||
|
|
||||||
|
- fetch `/api/llm/status?lane=<selected lane>`
|
||||||
|
- update title and runtime text per lane
|
||||||
|
- show lane-specific:
|
||||||
|
- manifest path
|
||||||
|
- train/eval/total counts
|
||||||
|
- dataset source
|
||||||
|
|
||||||
|
### URL dataset mode
|
||||||
|
|
||||||
|
The live dashboard environment on Erik was updated through `ecosystem.config.cjs`:
|
||||||
|
|
||||||
|
- `RUNPOD_DATASET_SOURCE=url`
|
||||||
|
- `RUNPOD_DATASET_SOURCE_MAGATAMALLM=url`
|
||||||
|
- `RUNPOD_DATASET_SOURCE_FO_BLOGLLM=url`
|
||||||
|
- `RUNPOD_DATASET_SOURCE_TIP_LLM=url`
|
||||||
|
- `MAGATAMA_PUBLIC_BASE_URL=https://magatama.fichtmueller.org`
|
||||||
|
|
||||||
|
Then `magatama-dashboard` was restarted with `--update-env`.
|
||||||
|
|
||||||
|
## Live Verification
|
||||||
|
|
||||||
|
Verified directly on Erik through:
|
||||||
|
|
||||||
|
- `http://127.0.0.1:3211/api/llm/status?lane=fo_blogllm`
|
||||||
|
- `http://127.0.0.1:3211/api/llm/status?lane=tip_llm`
|
||||||
|
|
||||||
|
### `fo_blogllm`
|
||||||
|
|
||||||
|
- `datasetSource = url`
|
||||||
|
- `collectionsPath = /opt/magatama/training-data/runpod/fo_blogllm/manifest.json`
|
||||||
|
- `trainFile = /opt/magatama/training-data/runpod/fo_blogllm/fo_blogllm-sft-train.jsonl`
|
||||||
|
- `validFile = /opt/magatama/training-data/runpod/fo_blogllm/fo_blogllm-sft-eval.jsonl`
|
||||||
|
- `collectedExamples = 28`
|
||||||
|
- `evalExamples = 4`
|
||||||
|
- `totalExamples = 32`
|
||||||
|
|
||||||
|
### `tip_llm`
|
||||||
|
|
||||||
|
- `datasetSource = url`
|
||||||
|
- `collectionsPath = /opt/magatama/training-data/runpod/tip_llm/manifest.json`
|
||||||
|
- `trainFile = /opt/magatama/training-data/runpod/tip_llm/tip_llm-sft-train.jsonl`
|
||||||
|
- `validFile = /opt/magatama/training-data/runpod/tip_llm/tip_llm-sft-eval.jsonl`
|
||||||
|
- `collectedExamples = 36`
|
||||||
|
- `evalExamples = 4`
|
||||||
|
- `totalExamples = 40`
|
||||||
|
|
||||||
|
### `magatamallm`
|
||||||
|
|
||||||
|
Still correctly shows the larger lane export:
|
||||||
|
|
||||||
|
- `collectedExamples = 15620`
|
||||||
|
- `evalExamples = 1736`
|
||||||
|
- `totalExamples = 17356`
|
||||||
|
|
||||||
|
## Operational Meaning
|
||||||
|
|
||||||
|
MAGATAMA training is now materially closer to the intended fully automated flow:
|
||||||
|
|
||||||
|
- each LLM lane shows and uses its own pool
|
||||||
|
- RunPod dataset preparation no longer requires Hugging Face dataset publication
|
||||||
|
- dataset fetch comes from MAGATAMA URL-bundle / lane export
|
||||||
|
|
||||||
|
This removes one major manual/external blocker from the RunPod training path.
|
||||||
|
|
||||||
|
## Remaining Truth
|
||||||
|
|
||||||
|
This fix does **not** automatically prove that every RunPod worker run itself succeeds end-to-end.
|
||||||
|
|
||||||
|
What is fixed:
|
||||||
|
|
||||||
|
- lane-specific training pool selection
|
||||||
|
- lane-specific UI/status
|
||||||
|
- URL dataset source activation
|
||||||
|
|
||||||
|
What still depends on RunPod worker behavior:
|
||||||
|
|
||||||
|
- real successful training execution
|
||||||
|
- durable model artifact production
|
||||||
|
- artifact adoption after completion
|
||||||
Loading…
x
Reference in New Issue
Block a user