From 61328b060727fe7274b876c590817ed07967c8a5 Mon Sep 17 00:00:00 2001 From: Rene Fichtmueller Date: Thu, 7 May 2026 01:36:36 +0200 Subject: [PATCH] sync: record lane-specific runpod adoption versioning --- sync/CURRENT.md | 71 ++++++++ ...ane-specific-runpod-adoption-versioning.md | 170 ++++++++++++++++++ 2 files changed, 241 insertions(+) create mode 100644 sync/history/2026-05-07-magatama-lane-specific-runpod-adoption-versioning.md diff --git a/sync/CURRENT.md b/sync/CURRENT.md index f897046..8df72f0 100644 --- a/sync/CURRENT.md +++ b/sync/CURRENT.md @@ -27,6 +27,77 @@ When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infr ## Latest Work +- MAGATAMA training automation was hardened locally on 2026-05-07 for all three lanes: + - target lanes: + - `magatamallm` + - `fo_blogllm` + - `tip_llm` + - core root cause confirmed: + - RunPod dataset refresh / lane export already worked + - RunPod jobs often reached `COMPLETED` + - but model adoption/version truth still depended on a single shared: + - `~/magatama-llm/fine-tuning/last_run.json` + - this made lane status and successful return/adoption ambiguous across models + - the training modal could also collapse late stream/adoption failures into a generic `network error` + - local code fixes now in place: + - `magatama/packages/fine-tuner/training_api.py` + - lane-specific last-run files added: + - `~/magatama-llm/fine-tuning/magatamallm-last_run.json` + - `~/magatama-llm/fine-tuning/fo_blogllm-last_run.json` + - `~/magatama-llm/fine-tuning/tip_llm-last_run.json` + - legacy `last_run.json` remains only as backward-compatible mirror for `magatamallm` + - successful RunPod adoption now creates: + - a release alias per lane, e.g. `-rN` + - active alias switching sequence is now: + - candidate model imported + - smoke-tested + - release alias created + - stable active alias repointed to that release alias + - adoption report now includes: + - `version_counter` + - `release_alias` + - `magatama/packages/fine-tuner/train.py` + - local metrics writing now also respects lane-specific last-run files via `TRAINING_LANE` + - `magatama/packages/dashboard/src/server.ts` + - `/api/llm/status` now reads lane-specific last-run metadata first + - `release_alias` is preferred as visible model version when present + - RunPod SSE catch now distinguishes: + - real generic training failure + - `COMPLETED` but no artifact / failed adoption + - the latter is now rendered as a truthful return/adoption failure, not a vague dataset/network issue + - `magatama/packages/dashboard/public/index-v2.html` + - training modal now suppresses misleading late generic `network error` if the server already emitted a terminal training status + - if the stream ends without a final terminal server event, the UI now explicitly says the registry/adoption state must be checked + - if the backend reports: + - completed without artifact + - completed without HF model + - completed but adoption failed + the modal now shows that exact reason + - local verification: + - `python3 -m py_compile` passed for: + - `training_api.py` + - `train.py` + - dashboard build passed: + - `pnpm -C packages/dashboard build` + - current operational blocker: + - live deployment to Erik was **not yet completed in this step** + - direct SSH checks returned: + - `Connection refused` + - then `Operation timed out` + - because of that, the new lane-specific automation logic is locally ready, but not yet confirmed live on Erik for the currently running: + - `tip_llm` + - `fo_blogllm` + - practical consequence: + - the code path is now prepared for full automation: + - pull from lane-specific training pool + - train on RunPod + - verify artifact existence + - adopt locally + - create new release alias/version + - repoint stable active alias + - show truthful status in UI + - but the current live Erik run still needs redeploy + verification once SSH is reachable again + - MAGATAMA local MagatamaLLM training state was re-verified on 2026-05-07: - result: - the lane export / dataset refresh worked diff --git a/sync/history/2026-05-07-magatama-lane-specific-runpod-adoption-versioning.md b/sync/history/2026-05-07-magatama-lane-specific-runpod-adoption-versioning.md new file mode 100644 index 0000000..5f66ea7 --- /dev/null +++ b/sync/history/2026-05-07-magatama-lane-specific-runpod-adoption-versioning.md @@ -0,0 +1,170 @@ +# MAGATAMA Lane-Specific RunPod Adoption + Versioning + +Date: 2026-05-07 + +## Scope + +Harden MAGATAMA training automation for: + +- `magatamallm` +- `fo_blogllm` +- `tip_llm` + +Goal: + +- lane-specific training pools remain isolated +- RunPod `COMPLETED` counts only when model return/adoption is real +- active lane model gets a new release/version marker after successful adoption +- dashboard status and errors remain truthful + +## Problem + +The data/build side of training already worked: + +- lane-specific RunPod datasets were built +- RunPod jobs were submitted +- registry often showed `IN_PROGRESS` / `COMPLETED` + +But the end of the chain remained weak: + +1. adoption/version truth still depended on one shared: + - `~/magatama-llm/fine-tuning/last_run.json` +2. multiple lanes could therefore overwrite the same success marker +3. the modal could degrade late-stream adoption failures into a generic `network error` +4. the user requirement was stricter: + - training pool -> RunPod -> artifact -> local import -> version bump -> active alias switch + - all fully automatic + +## Code changes made locally + +### 1. Lane-specific last-run metadata + +File: + +- `magatama/packages/fine-tuner/training_api.py` + +Added: + +- `lane_last_run_file(lane)` + +Resulting files: + +- `~/magatama-llm/fine-tuning/magatamallm-last_run.json` +- `~/magatama-llm/fine-tuning/fo_blogllm-last_run.json` +- `~/magatama-llm/fine-tuning/tip_llm-last_run.json` + +Compatibility: + +- `magatamallm` still mirrors to legacy: + - `~/magatama-llm/fine-tuning/last_run.json` + +### 2. Automatic release alias / version step + +File: + +- `magatama/packages/fine-tuner/training_api.py` + +Added: + +- `next_release_metadata(lane, active_model)` +- release alias creation + +New adoption sequence: + +1. RunPod artifact imported to candidate model +2. candidate smoke tests pass +3. release alias is created: + - example shape: `-rN` +4. stable active alias is repointed to that release alias + +This means the lane now receives a concrete new release/version marker after successful adoption. + +### 3. Dashboard lane status truth + +File: + +- `magatama/packages/dashboard/src/server.ts` + +Changed: + +- `/api/llm/status` now reads lane-specific last-run metadata first +- `release_alias` is preferred as visible model version +- this prevents one lane from falsely inheriting another lane's last successful run marker + +### 4. Truthful RunPod terminal failure messaging + +Files: + +- `magatama/packages/dashboard/src/server.ts` +- `magatama/packages/dashboard/public/index-v2.html` + +Changed: + +- if RunPod says `COMPLETED` but: + - no model artifact exists + - no HF repo appears + - adoption fails + +the UI now reports that exact reason instead of collapsing into a vague generic failure + +Frontend hardening: + +- avoid showing a misleading late `network error` after the server already emitted a terminal training event +- if the stream dies without a terminal event, the modal says so explicitly + +### 5. Local training metrics future-proofed + +File: + +- `magatama/packages/fine-tuner/train.py` + +Changed: + +- metrics now also respect lane-specific last-run files via `TRAINING_LANE` + +## Local verification + +Passed: + +- `python3 -m py_compile .../training_api.py .../train.py` +- `pnpm -C .../packages/dashboard build` + +## Live deployment state + +Not yet completed in this step. + +Reason: + +- direct Erik access failed during this block: + - `ssh: connect to host 82.165.222.127 port 22: Connection refused` + - later also `Operation timed out` + +Therefore: + +- the automation fix is locally ready +- but not yet verified live against the currently running: + - `tip_llm` + - `fo_blogllm` + +## Operational next step + +Once Erik SSH is reachable again: + +1. deploy updated: + - `training_api.py` + - `train.py` + - dashboard build / server bundle +2. restart: + - `magatama-dashboard` + - Mac-side training API if used +3. verify lane-specific status: + - `tip_llm` + - `fo_blogllm` + - `magatamallm` +4. verify that a successful RunPod training now results in: + - artifact found + - adoption report present + - lane-specific `*-last_run.json` + - release alias incremented + - stable alias repointed +