sync: record lane-specific runpod adoption versioning

2026-05-07 01:36:36 +02:00 · 2026-05-07 01:36:36 +02:00 · 61328b0607
commit 61328b0607
parent a6278a5041
2 changed files with 241 additions and 0 deletions
--- a/sync/CURRENT.md
+++ b/sync/CURRENT.md
@ -27,6 +27,77 @@ When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infr

 ## Latest Work

+- MAGATAMA training automation was hardened locally on 2026-05-07 for all three lanes:
+  - target lanes:
+    - `magatamallm`
+    - `fo_blogllm`
+    - `tip_llm`
+  - core root cause confirmed:
+    - RunPod dataset refresh / lane export already worked
+    - RunPod jobs often reached `COMPLETED`
+    - but model adoption/version truth still depended on a single shared:
+      - `~/magatama-llm/fine-tuning/last_run.json`
+    - this made lane status and successful return/adoption ambiguous across models
+    - the training modal could also collapse late stream/adoption failures into a generic `network error`
+  - local code fixes now in place:
+    - `magatama/packages/fine-tuner/training_api.py`
+      - lane-specific last-run files added:
+        - `~/magatama-llm/fine-tuning/magatamallm-last_run.json`
+        - `~/magatama-llm/fine-tuning/fo_blogllm-last_run.json`
+        - `~/magatama-llm/fine-tuning/tip_llm-last_run.json`
+      - legacy `last_run.json` remains only as backward-compatible mirror for `magatamallm`
+      - successful RunPod adoption now creates:
+        - a release alias per lane, e.g. `<active-alias>-rN`
+      - active alias switching sequence is now:
+        - candidate model imported
+        - smoke-tested
+        - release alias created
+        - stable active alias repointed to that release alias
+      - adoption report now includes:
+        - `version_counter`
+        - `release_alias`
+    - `magatama/packages/fine-tuner/train.py`
+      - local metrics writing now also respects lane-specific last-run files via `TRAINING_LANE`
+    - `magatama/packages/dashboard/src/server.ts`
+      - `/api/llm/status` now reads lane-specific last-run metadata first
+      - `release_alias` is preferred as visible model version when present
+      - RunPod SSE catch now distinguishes:
+        - real generic training failure
+        - `COMPLETED` but no artifact / failed adoption
+      - the latter is now rendered as a truthful return/adoption failure, not a vague dataset/network issue
+    - `magatama/packages/dashboard/public/index-v2.html`
+      - training modal now suppresses misleading late generic `network error` if the server already emitted a terminal training status
+      - if the stream ends without a final terminal server event, the UI now explicitly says the registry/adoption state must be checked
+      - if the backend reports:
+        - completed without artifact
+        - completed without HF model
+        - completed but adoption failed
+        the modal now shows that exact reason
+  - local verification:
+    - `python3 -m py_compile` passed for:
+      - `training_api.py`
+      - `train.py`
+    - dashboard build passed:
+      - `pnpm -C packages/dashboard build`
+  - current operational blocker:
+    - live deployment to Erik was **not yet completed in this step**
+    - direct SSH checks returned:
+      - `Connection refused`
+      - then `Operation timed out`
+    - because of that, the new lane-specific automation logic is locally ready, but not yet confirmed live on Erik for the currently running:
+      - `tip_llm`
+      - `fo_blogllm`
+  - practical consequence:
+    - the code path is now prepared for full automation:
+      - pull from lane-specific training pool
+      - train on RunPod
+      - verify artifact existence
+      - adopt locally
+      - create new release alias/version
+      - repoint stable active alias
+      - show truthful status in UI
+    - but the current live Erik run still needs redeploy + verification once SSH is reachable again
+
 - MAGATAMA local MagatamaLLM training state was re-verified on 2026-05-07:
  - result:
    - the lane export / dataset refresh worked
--- a/sync/history/2026-05-07-magatama-lane-specific-runpod-adoption-versioning.md
+++ b/sync/history/2026-05-07-magatama-lane-specific-runpod-adoption-versioning.md
@ -0,0 +1,170 @@
+# MAGATAMA Lane-Specific RunPod Adoption + Versioning
+
+Date: 2026-05-07
+
+## Scope
+
+Harden MAGATAMA training automation for:
+
+- `magatamallm`
+- `fo_blogllm`
+- `tip_llm`
+
+Goal:
+
+- lane-specific training pools remain isolated
+- RunPod `COMPLETED` counts only when model return/adoption is real
+- active lane model gets a new release/version marker after successful adoption
+- dashboard status and errors remain truthful
+
+## Problem
+
+The data/build side of training already worked:
+
+- lane-specific RunPod datasets were built
+- RunPod jobs were submitted
+- registry often showed `IN_PROGRESS` / `COMPLETED`
+
+But the end of the chain remained weak:
+
+1. adoption/version truth still depended on one shared:
+   - `~/magatama-llm/fine-tuning/last_run.json`
+2. multiple lanes could therefore overwrite the same success marker
+3. the modal could degrade late-stream adoption failures into a generic `network error`
+4. the user requirement was stricter:
+   - training pool -> RunPod -> artifact -> local import -> version bump -> active alias switch
+   - all fully automatic
+
+## Code changes made locally
+
+### 1. Lane-specific last-run metadata
+
+File:
+
+- `magatama/packages/fine-tuner/training_api.py`
+
+Added:
+
+- `lane_last_run_file(lane)`
+
+Resulting files:
+
+- `~/magatama-llm/fine-tuning/magatamallm-last_run.json`
+- `~/magatama-llm/fine-tuning/fo_blogllm-last_run.json`
+- `~/magatama-llm/fine-tuning/tip_llm-last_run.json`
+
+Compatibility:
+
+- `magatamallm` still mirrors to legacy:
+  - `~/magatama-llm/fine-tuning/last_run.json`
+
+### 2. Automatic release alias / version step
+
+File:
+
+- `magatama/packages/fine-tuner/training_api.py`
+
+Added:
+
+- `next_release_metadata(lane, active_model)`
+- release alias creation
+
+New adoption sequence:
+
+1. RunPod artifact imported to candidate model
+2. candidate smoke tests pass
+3. release alias is created:
+   - example shape: `<active-alias>-rN`
+4. stable active alias is repointed to that release alias
+
+This means the lane now receives a concrete new release/version marker after successful adoption.
+
+### 3. Dashboard lane status truth
+
+File:
+
+- `magatama/packages/dashboard/src/server.ts`
+
+Changed:
+
+- `/api/llm/status` now reads lane-specific last-run metadata first
+- `release_alias` is preferred as visible model version
+- this prevents one lane from falsely inheriting another lane's last successful run marker
+
+### 4. Truthful RunPod terminal failure messaging
+
+Files:
+
+- `magatama/packages/dashboard/src/server.ts`
+- `magatama/packages/dashboard/public/index-v2.html`
+
+Changed:
+
+- if RunPod says `COMPLETED` but:
+  - no model artifact exists
+  - no HF repo appears
+  - adoption fails
+
+the UI now reports that exact reason instead of collapsing into a vague generic failure
+
+Frontend hardening:
+
+- avoid showing a misleading late `network error` after the server already emitted a terminal training event
+- if the stream dies without a terminal event, the modal says so explicitly
+
+### 5. Local training metrics future-proofed
+
+File:
+
+- `magatama/packages/fine-tuner/train.py`
+
+Changed:
+
+- metrics now also respect lane-specific last-run files via `TRAINING_LANE`
+
+## Local verification
+
+Passed:
+
+- `python3 -m py_compile .../training_api.py .../train.py`
+- `pnpm -C .../packages/dashboard build`
+
+## Live deployment state
+
+Not yet completed in this step.
+
+Reason:
+
+- direct Erik access failed during this block:
+  - `ssh: connect to host 82.165.222.127 port 22: Connection refused`
+  - later also `Operation timed out`
+
+Therefore:
+
+- the automation fix is locally ready
+- but not yet verified live against the currently running:
+  - `tip_llm`
+  - `fo_blogllm`
+
+## Operational next step
+
+Once Erik SSH is reachable again:
+
+1. deploy updated:
+   - `training_api.py`
+   - `train.py`
+   - dashboard build / server bundle
+2. restart:
+   - `magatama-dashboard`
+   - Mac-side training API if used
+3. verify lane-specific status:
+   - `tip_llm`
+   - `fo_blogllm`
+   - `magatamallm`
+4. verify that a successful RunPod training now results in:
+   - artifact found
+   - adoption report present
+   - lane-specific `*-last_run.json`
+   - release alias incremented
+   - stable alias repointed
+