4.0 KiB
4.0 KiB
MAGATAMA Lane-Specific RunPod Adoption + Versioning
Date: 2026-05-07
Scope
Harden MAGATAMA training automation for:
magatamallmfo_blogllmtip_llm
Goal:
- lane-specific training pools remain isolated
- RunPod
COMPLETEDcounts only when model return/adoption is real - active lane model gets a new release/version marker after successful adoption
- dashboard status and errors remain truthful
Problem
The data/build side of training already worked:
- lane-specific RunPod datasets were built
- RunPod jobs were submitted
- registry often showed
IN_PROGRESS/COMPLETED
But the end of the chain remained weak:
- adoption/version truth still depended on one shared:
~/magatama-llm/fine-tuning/last_run.json
- multiple lanes could therefore overwrite the same success marker
- the modal could degrade late-stream adoption failures into a generic
network error - the user requirement was stricter:
- training pool -> RunPod -> artifact -> local import -> version bump -> active alias switch
- all fully automatic
Code changes made locally
1. Lane-specific last-run metadata
File:
magatama/packages/fine-tuner/training_api.py
Added:
lane_last_run_file(lane)
Resulting files:
~/magatama-llm/fine-tuning/magatamallm-last_run.json~/magatama-llm/fine-tuning/fo_blogllm-last_run.json~/magatama-llm/fine-tuning/tip_llm-last_run.json
Compatibility:
magatamallmstill mirrors to legacy:~/magatama-llm/fine-tuning/last_run.json
2. Automatic release alias / version step
File:
magatama/packages/fine-tuner/training_api.py
Added:
next_release_metadata(lane, active_model)- release alias creation
New adoption sequence:
- RunPod artifact imported to candidate model
- candidate smoke tests pass
- release alias is created:
- example shape:
<active-alias>-rN
- example shape:
- stable active alias is repointed to that release alias
This means the lane now receives a concrete new release/version marker after successful adoption.
3. Dashboard lane status truth
File:
magatama/packages/dashboard/src/server.ts
Changed:
/api/llm/statusnow reads lane-specific last-run metadata firstrelease_aliasis preferred as visible model version- this prevents one lane from falsely inheriting another lane's last successful run marker
4. Truthful RunPod terminal failure messaging
Files:
magatama/packages/dashboard/src/server.tsmagatama/packages/dashboard/public/index-v2.html
Changed:
- if RunPod says
COMPLETEDbut:- no model artifact exists
- no HF repo appears
- adoption fails
the UI now reports that exact reason instead of collapsing into a vague generic failure
Frontend hardening:
- avoid showing a misleading late
network errorafter the server already emitted a terminal training event - if the stream dies without a terminal event, the modal says so explicitly
5. Local training metrics future-proofed
File:
magatama/packages/fine-tuner/train.py
Changed:
- metrics now also respect lane-specific last-run files via
TRAINING_LANE
Local verification
Passed:
python3 -m py_compile .../training_api.py .../train.pypnpm -C .../packages/dashboard build
Live deployment state
Not yet completed in this step.
Reason:
- direct Erik access failed during this block:
ssh: connect to host 82.165.222.127 port 22: Connection refused- later also
Operation timed out
Therefore:
- the automation fix is locally ready
- but not yet verified live against the currently running:
tip_llmfo_blogllm
Operational next step
Once Erik SSH is reachable again:
- deploy updated:
training_api.pytrain.py- dashboard build / server bundle
- restart:
magatama-dashboard- Mac-side training API if used
- verify lane-specific status:
tip_llmfo_blogllmmagatamallm
- verify that a successful RunPod training now results in:
- artifact found
- adoption report present
- lane-specific
*-last_run.json - release alias incremented
- stable alias repointed