diff --git a/sync/CURRENT.md b/sync/CURRENT.md index 6a6379f..df38802 100644 --- a/sync/CURRENT.md +++ b/sync/CURRENT.md @@ -1,6 +1,6 @@ # Current TIP Sync State -Updated: 2026-05-06 15:24 UTC +Updated: 2026-05-06 15:48 UTC ## Active Policy @@ -65,6 +65,40 @@ When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infr - `generatedAt = 2026-05-06T15:18:42.708Z` - latest visible entries include `2026-04-30` items again instead of appearing frozen at `30.05` +- MAGATAMA lane-specific training pools and RunPod dataset automation were finished on 2026-05-06: + - root cause: + - the training modal always fetched `/api/llm/status` without a lane, so `FO_BlogLLM` and `TIP_LLM` still showed the `magatamallm` pool. + - dashboard/server were updated so `/api/llm/status?lane=...` is now truly lane-aware. + - the training modal now refreshes per selected lane and rewrites: + - title + - runtime label + - pool path + - counts + - dataset source + - MAGATAMA dashboard env on Erik was switched to URL dataset mode for all lanes via `ecosystem.config.cjs`: + - `RUNPOD_DATASET_SOURCE=url` + - `RUNPOD_DATASET_SOURCE_MAGATAMALLM=url` + - `RUNPOD_DATASET_SOURCE_FO_BLOGLLM=url` + - `RUNPOD_DATASET_SOURCE_TIP_LLM=url` + - live verified on Erik after restart: + - `fo_blogllm` + - `datasetSource = url` + - `collectionsPath = /opt/magatama/training-data/runpod/fo_blogllm/manifest.json` + - `train = 28` + - `eval = 4` + - `total = 32` + - `tip_llm` + - `datasetSource = url` + - `collectionsPath = /opt/magatama/training-data/runpod/tip_llm/manifest.json` + - `train = 36` + - `eval = 4` + - `total = 40` + - `magatamallm` + - remains on lane-export counts (`15620 / 1736 / 17356`) + - operator impact: + - no Hugging Face dataset publish is required anymore for MAGATAMA RunPod launches. + - every supported LLM lane now points to its own local/Gitea-backed lane export instead of reusing `magatamallm`. + - MAGATAMA training + Attack Paths + Atlas exposure were corrected again on 2026-05-06: - the RunPod serverless training start failure was not a RunPod outage. - root cause was missing training scripts on Erik (`training_full_refresh.ts` and related helpers were absent under `/opt/magatama/scripts`). diff --git a/sync/history/2026-05-06-magatama-lane-specific-training-pools-and-url-runpod.md b/sync/history/2026-05-06-magatama-lane-specific-training-pools-and-url-runpod.md new file mode 100644 index 0000000..4b0575a --- /dev/null +++ b/sync/history/2026-05-06-magatama-lane-specific-training-pools-and-url-runpod.md @@ -0,0 +1,115 @@ +# MAGATAMA Lane-Specific Training Pools + URL RunPod Dataset Mode + +Date: 2026-05-06 +Author: Codex + +## Problem + +The MAGATAMA training modal still showed the `magatamallm` pool even when the operator selected: + +- `FO_BlogLLM` +- `TIP_LLM` + +As a result, the UI implied that all training lanes reused the same pool and counts. + +At the same time, RunPod launches still depended on Hugging Face dataset publication unless explicitly changed. + +## Root Cause + +1. The training modal fetched: + +- `/api/llm/status` + +without a lane parameter. + +2. The backend status route therefore always returned the default `magatamallm` training corpus/lane. + +3. Dashboard env on Erik was still effectively using the Hugging Face dataset path for RunPod dataset source. + +## Fix + +### Lane-aware status + +`/api/llm/status` now accepts the selected lane and returns lane-specific training metadata. + +The training modal was updated to: + +- fetch `/api/llm/status?lane=` +- update title and runtime text per lane +- show lane-specific: + - manifest path + - train/eval/total counts + - dataset source + +### URL dataset mode + +The live dashboard environment on Erik was updated through `ecosystem.config.cjs`: + +- `RUNPOD_DATASET_SOURCE=url` +- `RUNPOD_DATASET_SOURCE_MAGATAMALLM=url` +- `RUNPOD_DATASET_SOURCE_FO_BLOGLLM=url` +- `RUNPOD_DATASET_SOURCE_TIP_LLM=url` +- `MAGATAMA_PUBLIC_BASE_URL=https://magatama.fichtmueller.org` + +Then `magatama-dashboard` was restarted with `--update-env`. + +## Live Verification + +Verified directly on Erik through: + +- `http://127.0.0.1:3211/api/llm/status?lane=fo_blogllm` +- `http://127.0.0.1:3211/api/llm/status?lane=tip_llm` + +### `fo_blogllm` + +- `datasetSource = url` +- `collectionsPath = /opt/magatama/training-data/runpod/fo_blogllm/manifest.json` +- `trainFile = /opt/magatama/training-data/runpod/fo_blogllm/fo_blogllm-sft-train.jsonl` +- `validFile = /opt/magatama/training-data/runpod/fo_blogllm/fo_blogllm-sft-eval.jsonl` +- `collectedExamples = 28` +- `evalExamples = 4` +- `totalExamples = 32` + +### `tip_llm` + +- `datasetSource = url` +- `collectionsPath = /opt/magatama/training-data/runpod/tip_llm/manifest.json` +- `trainFile = /opt/magatama/training-data/runpod/tip_llm/tip_llm-sft-train.jsonl` +- `validFile = /opt/magatama/training-data/runpod/tip_llm/tip_llm-sft-eval.jsonl` +- `collectedExamples = 36` +- `evalExamples = 4` +- `totalExamples = 40` + +### `magatamallm` + +Still correctly shows the larger lane export: + +- `collectedExamples = 15620` +- `evalExamples = 1736` +- `totalExamples = 17356` + +## Operational Meaning + +MAGATAMA training is now materially closer to the intended fully automated flow: + +- each LLM lane shows and uses its own pool +- RunPod dataset preparation no longer requires Hugging Face dataset publication +- dataset fetch comes from MAGATAMA URL-bundle / lane export + +This removes one major manual/external blocker from the RunPod training path. + +## Remaining Truth + +This fix does **not** automatically prove that every RunPod worker run itself succeeds end-to-end. + +What is fixed: + +- lane-specific training pool selection +- lane-specific UI/status +- URL dataset source activation + +What still depends on RunPod worker behavior: + +- real successful training execution +- durable model artifact production +- artifact adoption after completion