sync: record lane-specific runpod adoption versioning
This commit is contained in:
parent
a6278a5041
commit
61328b0607
@ -27,6 +27,77 @@ When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infr
|
||||
|
||||
## Latest Work
|
||||
|
||||
- MAGATAMA training automation was hardened locally on 2026-05-07 for all three lanes:
|
||||
- target lanes:
|
||||
- `magatamallm`
|
||||
- `fo_blogllm`
|
||||
- `tip_llm`
|
||||
- core root cause confirmed:
|
||||
- RunPod dataset refresh / lane export already worked
|
||||
- RunPod jobs often reached `COMPLETED`
|
||||
- but model adoption/version truth still depended on a single shared:
|
||||
- `~/magatama-llm/fine-tuning/last_run.json`
|
||||
- this made lane status and successful return/adoption ambiguous across models
|
||||
- the training modal could also collapse late stream/adoption failures into a generic `network error`
|
||||
- local code fixes now in place:
|
||||
- `magatama/packages/fine-tuner/training_api.py`
|
||||
- lane-specific last-run files added:
|
||||
- `~/magatama-llm/fine-tuning/magatamallm-last_run.json`
|
||||
- `~/magatama-llm/fine-tuning/fo_blogllm-last_run.json`
|
||||
- `~/magatama-llm/fine-tuning/tip_llm-last_run.json`
|
||||
- legacy `last_run.json` remains only as backward-compatible mirror for `magatamallm`
|
||||
- successful RunPod adoption now creates:
|
||||
- a release alias per lane, e.g. `<active-alias>-rN`
|
||||
- active alias switching sequence is now:
|
||||
- candidate model imported
|
||||
- smoke-tested
|
||||
- release alias created
|
||||
- stable active alias repointed to that release alias
|
||||
- adoption report now includes:
|
||||
- `version_counter`
|
||||
- `release_alias`
|
||||
- `magatama/packages/fine-tuner/train.py`
|
||||
- local metrics writing now also respects lane-specific last-run files via `TRAINING_LANE`
|
||||
- `magatama/packages/dashboard/src/server.ts`
|
||||
- `/api/llm/status` now reads lane-specific last-run metadata first
|
||||
- `release_alias` is preferred as visible model version when present
|
||||
- RunPod SSE catch now distinguishes:
|
||||
- real generic training failure
|
||||
- `COMPLETED` but no artifact / failed adoption
|
||||
- the latter is now rendered as a truthful return/adoption failure, not a vague dataset/network issue
|
||||
- `magatama/packages/dashboard/public/index-v2.html`
|
||||
- training modal now suppresses misleading late generic `network error` if the server already emitted a terminal training status
|
||||
- if the stream ends without a final terminal server event, the UI now explicitly says the registry/adoption state must be checked
|
||||
- if the backend reports:
|
||||
- completed without artifact
|
||||
- completed without HF model
|
||||
- completed but adoption failed
|
||||
the modal now shows that exact reason
|
||||
- local verification:
|
||||
- `python3 -m py_compile` passed for:
|
||||
- `training_api.py`
|
||||
- `train.py`
|
||||
- dashboard build passed:
|
||||
- `pnpm -C packages/dashboard build`
|
||||
- current operational blocker:
|
||||
- live deployment to Erik was **not yet completed in this step**
|
||||
- direct SSH checks returned:
|
||||
- `Connection refused`
|
||||
- then `Operation timed out`
|
||||
- because of that, the new lane-specific automation logic is locally ready, but not yet confirmed live on Erik for the currently running:
|
||||
- `tip_llm`
|
||||
- `fo_blogllm`
|
||||
- practical consequence:
|
||||
- the code path is now prepared for full automation:
|
||||
- pull from lane-specific training pool
|
||||
- train on RunPod
|
||||
- verify artifact existence
|
||||
- adopt locally
|
||||
- create new release alias/version
|
||||
- repoint stable active alias
|
||||
- show truthful status in UI
|
||||
- but the current live Erik run still needs redeploy + verification once SSH is reachable again
|
||||
|
||||
- MAGATAMA local MagatamaLLM training state was re-verified on 2026-05-07:
|
||||
- result:
|
||||
- the lane export / dataset refresh worked
|
||||
|
||||
@ -0,0 +1,170 @@
|
||||
# MAGATAMA Lane-Specific RunPod Adoption + Versioning
|
||||
|
||||
Date: 2026-05-07
|
||||
|
||||
## Scope
|
||||
|
||||
Harden MAGATAMA training automation for:
|
||||
|
||||
- `magatamallm`
|
||||
- `fo_blogllm`
|
||||
- `tip_llm`
|
||||
|
||||
Goal:
|
||||
|
||||
- lane-specific training pools remain isolated
|
||||
- RunPod `COMPLETED` counts only when model return/adoption is real
|
||||
- active lane model gets a new release/version marker after successful adoption
|
||||
- dashboard status and errors remain truthful
|
||||
|
||||
## Problem
|
||||
|
||||
The data/build side of training already worked:
|
||||
|
||||
- lane-specific RunPod datasets were built
|
||||
- RunPod jobs were submitted
|
||||
- registry often showed `IN_PROGRESS` / `COMPLETED`
|
||||
|
||||
But the end of the chain remained weak:
|
||||
|
||||
1. adoption/version truth still depended on one shared:
|
||||
- `~/magatama-llm/fine-tuning/last_run.json`
|
||||
2. multiple lanes could therefore overwrite the same success marker
|
||||
3. the modal could degrade late-stream adoption failures into a generic `network error`
|
||||
4. the user requirement was stricter:
|
||||
- training pool -> RunPod -> artifact -> local import -> version bump -> active alias switch
|
||||
- all fully automatic
|
||||
|
||||
## Code changes made locally
|
||||
|
||||
### 1. Lane-specific last-run metadata
|
||||
|
||||
File:
|
||||
|
||||
- `magatama/packages/fine-tuner/training_api.py`
|
||||
|
||||
Added:
|
||||
|
||||
- `lane_last_run_file(lane)`
|
||||
|
||||
Resulting files:
|
||||
|
||||
- `~/magatama-llm/fine-tuning/magatamallm-last_run.json`
|
||||
- `~/magatama-llm/fine-tuning/fo_blogllm-last_run.json`
|
||||
- `~/magatama-llm/fine-tuning/tip_llm-last_run.json`
|
||||
|
||||
Compatibility:
|
||||
|
||||
- `magatamallm` still mirrors to legacy:
|
||||
- `~/magatama-llm/fine-tuning/last_run.json`
|
||||
|
||||
### 2. Automatic release alias / version step
|
||||
|
||||
File:
|
||||
|
||||
- `magatama/packages/fine-tuner/training_api.py`
|
||||
|
||||
Added:
|
||||
|
||||
- `next_release_metadata(lane, active_model)`
|
||||
- release alias creation
|
||||
|
||||
New adoption sequence:
|
||||
|
||||
1. RunPod artifact imported to candidate model
|
||||
2. candidate smoke tests pass
|
||||
3. release alias is created:
|
||||
- example shape: `<active-alias>-rN`
|
||||
4. stable active alias is repointed to that release alias
|
||||
|
||||
This means the lane now receives a concrete new release/version marker after successful adoption.
|
||||
|
||||
### 3. Dashboard lane status truth
|
||||
|
||||
File:
|
||||
|
||||
- `magatama/packages/dashboard/src/server.ts`
|
||||
|
||||
Changed:
|
||||
|
||||
- `/api/llm/status` now reads lane-specific last-run metadata first
|
||||
- `release_alias` is preferred as visible model version
|
||||
- this prevents one lane from falsely inheriting another lane's last successful run marker
|
||||
|
||||
### 4. Truthful RunPod terminal failure messaging
|
||||
|
||||
Files:
|
||||
|
||||
- `magatama/packages/dashboard/src/server.ts`
|
||||
- `magatama/packages/dashboard/public/index-v2.html`
|
||||
|
||||
Changed:
|
||||
|
||||
- if RunPod says `COMPLETED` but:
|
||||
- no model artifact exists
|
||||
- no HF repo appears
|
||||
- adoption fails
|
||||
|
||||
the UI now reports that exact reason instead of collapsing into a vague generic failure
|
||||
|
||||
Frontend hardening:
|
||||
|
||||
- avoid showing a misleading late `network error` after the server already emitted a terminal training event
|
||||
- if the stream dies without a terminal event, the modal says so explicitly
|
||||
|
||||
### 5. Local training metrics future-proofed
|
||||
|
||||
File:
|
||||
|
||||
- `magatama/packages/fine-tuner/train.py`
|
||||
|
||||
Changed:
|
||||
|
||||
- metrics now also respect lane-specific last-run files via `TRAINING_LANE`
|
||||
|
||||
## Local verification
|
||||
|
||||
Passed:
|
||||
|
||||
- `python3 -m py_compile .../training_api.py .../train.py`
|
||||
- `pnpm -C .../packages/dashboard build`
|
||||
|
||||
## Live deployment state
|
||||
|
||||
Not yet completed in this step.
|
||||
|
||||
Reason:
|
||||
|
||||
- direct Erik access failed during this block:
|
||||
- `ssh: connect to host 82.165.222.127 port 22: Connection refused`
|
||||
- later also `Operation timed out`
|
||||
|
||||
Therefore:
|
||||
|
||||
- the automation fix is locally ready
|
||||
- but not yet verified live against the currently running:
|
||||
- `tip_llm`
|
||||
- `fo_blogllm`
|
||||
|
||||
## Operational next step
|
||||
|
||||
Once Erik SSH is reachable again:
|
||||
|
||||
1. deploy updated:
|
||||
- `training_api.py`
|
||||
- `train.py`
|
||||
- dashboard build / server bundle
|
||||
2. restart:
|
||||
- `magatama-dashboard`
|
||||
- Mac-side training API if used
|
||||
3. verify lane-specific status:
|
||||
- `tip_llm`
|
||||
- `fo_blogllm`
|
||||
- `magatamallm`
|
||||
4. verify that a successful RunPod training now results in:
|
||||
- artifact found
|
||||
- adoption report present
|
||||
- lane-specific `*-last_run.json`
|
||||
- release alias incremented
|
||||
- stable alias repointed
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user