transceiver-db/sync/history/2026-05-07-magatama-lane-specific-runpod-adoption-versioning.md
2026-05-07 01:36:36 +02:00

4.0 KiB

MAGATAMA Lane-Specific RunPod Adoption + Versioning

Date: 2026-05-07

Scope

Harden MAGATAMA training automation for:

  • magatamallm
  • fo_blogllm
  • tip_llm

Goal:

  • lane-specific training pools remain isolated
  • RunPod COMPLETED counts only when model return/adoption is real
  • active lane model gets a new release/version marker after successful adoption
  • dashboard status and errors remain truthful

Problem

The data/build side of training already worked:

  • lane-specific RunPod datasets were built
  • RunPod jobs were submitted
  • registry often showed IN_PROGRESS / COMPLETED

But the end of the chain remained weak:

  1. adoption/version truth still depended on one shared:
    • ~/magatama-llm/fine-tuning/last_run.json
  2. multiple lanes could therefore overwrite the same success marker
  3. the modal could degrade late-stream adoption failures into a generic network error
  4. the user requirement was stricter:
    • training pool -> RunPod -> artifact -> local import -> version bump -> active alias switch
    • all fully automatic

Code changes made locally

1. Lane-specific last-run metadata

File:

  • magatama/packages/fine-tuner/training_api.py

Added:

  • lane_last_run_file(lane)

Resulting files:

  • ~/magatama-llm/fine-tuning/magatamallm-last_run.json
  • ~/magatama-llm/fine-tuning/fo_blogllm-last_run.json
  • ~/magatama-llm/fine-tuning/tip_llm-last_run.json

Compatibility:

  • magatamallm still mirrors to legacy:
    • ~/magatama-llm/fine-tuning/last_run.json

2. Automatic release alias / version step

File:

  • magatama/packages/fine-tuner/training_api.py

Added:

  • next_release_metadata(lane, active_model)
  • release alias creation

New adoption sequence:

  1. RunPod artifact imported to candidate model
  2. candidate smoke tests pass
  3. release alias is created:
    • example shape: <active-alias>-rN
  4. stable active alias is repointed to that release alias

This means the lane now receives a concrete new release/version marker after successful adoption.

3. Dashboard lane status truth

File:

  • magatama/packages/dashboard/src/server.ts

Changed:

  • /api/llm/status now reads lane-specific last-run metadata first
  • release_alias is preferred as visible model version
  • this prevents one lane from falsely inheriting another lane's last successful run marker

4. Truthful RunPod terminal failure messaging

Files:

  • magatama/packages/dashboard/src/server.ts
  • magatama/packages/dashboard/public/index-v2.html

Changed:

  • if RunPod says COMPLETED but:
    • no model artifact exists
    • no HF repo appears
    • adoption fails

the UI now reports that exact reason instead of collapsing into a vague generic failure

Frontend hardening:

  • avoid showing a misleading late network error after the server already emitted a terminal training event
  • if the stream dies without a terminal event, the modal says so explicitly

5. Local training metrics future-proofed

File:

  • magatama/packages/fine-tuner/train.py

Changed:

  • metrics now also respect lane-specific last-run files via TRAINING_LANE

Local verification

Passed:

  • python3 -m py_compile .../training_api.py .../train.py
  • pnpm -C .../packages/dashboard build

Live deployment state

Not yet completed in this step.

Reason:

  • direct Erik access failed during this block:
    • ssh: connect to host 82.165.222.127 port 22: Connection refused
    • later also Operation timed out

Therefore:

  • the automation fix is locally ready
  • but not yet verified live against the currently running:
    • tip_llm
    • fo_blogllm

Operational next step

Once Erik SSH is reachable again:

  1. deploy updated:
    • training_api.py
    • train.py
    • dashboard build / server bundle
  2. restart:
    • magatama-dashboard
    • Mac-side training API if used
  3. verify lane-specific status:
    • tip_llm
    • fo_blogllm
    • magatamallm
  4. verify that a successful RunPod training now results in:
    • artifact found
    • adoption report present
    • lane-specific *-last_run.json
    • release alias incremented
    • stable alias repointed