transceiver-db/sync/history/2026-05-07-magatama-lane-specific-runpod-adoption-versioning.md

# MAGATAMA Lane-Specific RunPod Adoption + Versioning

Date: 2026-05-07

## Scope

Harden MAGATAMA training automation for:

- `magatamallm`
- `fo_blogllm`
- `tip_llm`

Goal:

- lane-specific training pools remain isolated
- RunPod `COMPLETED` counts only when model return/adoption is real
- active lane model gets a new release/version marker after successful adoption
- dashboard status and errors remain truthful

## Problem

The data/build side of training already worked:

- lane-specific RunPod datasets were built
- RunPod jobs were submitted
- registry often showed `IN_PROGRESS` / `COMPLETED`

But the end of the chain remained weak:

1. adoption/version truth still depended on one shared:
   - `~/magatama-llm/fine-tuning/last_run.json`
2. multiple lanes could therefore overwrite the same success marker
3. the modal could degrade late-stream adoption failures into a generic `network error`
4. the user requirement was stricter:
   - training pool -> RunPod -> artifact -> local import -> version bump -> active alias switch
   - all fully automatic

## Code changes made locally

### 1. Lane-specific last-run metadata

File:

- `magatama/packages/fine-tuner/training_api.py`

Added:

- `lane_last_run_file(lane)`

Resulting files:

- `~/magatama-llm/fine-tuning/magatamallm-last_run.json`
- `~/magatama-llm/fine-tuning/fo_blogllm-last_run.json`
- `~/magatama-llm/fine-tuning/tip_llm-last_run.json`

Compatibility:

- `magatamallm` still mirrors to legacy:
  - `~/magatama-llm/fine-tuning/last_run.json`

### 2. Automatic release alias / version step

File:

- `magatama/packages/fine-tuner/training_api.py`

Added:

- `next_release_metadata(lane, active_model)`
- release alias creation

New adoption sequence:

1. RunPod artifact imported to candidate model
2. candidate smoke tests pass
3. release alias is created:
   - example shape: `<active-alias>-rN`
4. stable active alias is repointed to that release alias

This means the lane now receives a concrete new release/version marker after successful adoption.

### 3. Dashboard lane status truth

File:

- `magatama/packages/dashboard/src/server.ts`

Changed:

- `/api/llm/status` now reads lane-specific last-run metadata first
- `release_alias` is preferred as visible model version
- this prevents one lane from falsely inheriting another lane's last successful run marker

### 4. Truthful RunPod terminal failure messaging

Files:

- `magatama/packages/dashboard/src/server.ts`
- `magatama/packages/dashboard/public/index-v2.html`

Changed:

- if RunPod says `COMPLETED` but:
  - no model artifact exists
  - no HF repo appears
  - adoption fails

the UI now reports that exact reason instead of collapsing into a vague generic failure

Frontend hardening:

- avoid showing a misleading late `network error` after the server already emitted a terminal training event
- if the stream dies without a terminal event, the modal says so explicitly

### 5. Local training metrics future-proofed

File:

- `magatama/packages/fine-tuner/train.py`

Changed:

- metrics now also respect lane-specific last-run files via `TRAINING_LANE`

## Local verification

Passed:

- `python3 -m py_compile .../training_api.py .../train.py`
- `pnpm -C .../packages/dashboard build`

## Live deployment state

Not yet completed in this step.

Reason:

- direct Erik access failed during this block:
  - `ssh: connect to host 82.165.222.127 port 22: Connection refused`
  - later also `Operation timed out`

Therefore:

- the automation fix is locally ready
- but not yet verified live against the currently running:
  - `tip_llm`
  - `fo_blogllm`

## Operational next step

Once Erik SSH is reachable again:

1. deploy updated:
   - `training_api.py`
   - `train.py`
   - dashboard build / server bundle
2. restart:
   - `magatama-dashboard`
   - Mac-side training API if used
3. verify lane-specific status:
   - `tip_llm`
   - `fo_blogllm`
   - `magatamallm`
4. verify that a successful RunPod training now results in:
   - artifact found
   - adoption report present
   - lane-specific `*-last_run.json`
   - release alias incremented
   - stable alias repointed