sync: record runpod heartbeat and terminal truth
This commit is contained in:
parent
01d0365fbf
commit
21b56ead81
@ -1,6 +1,6 @@
|
||||
# Current TIP Sync State
|
||||
|
||||
Updated: 2026-05-07 02:58 UTC
|
||||
Updated: 2026-05-07 08:05 UTC
|
||||
|
||||
## Active Policy
|
||||
|
||||
@ -27,6 +27,60 @@ When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infr
|
||||
|
||||
## Latest Work
|
||||
|
||||
- RunPod/MAGATAMA training live follow-up on 2026-05-07:
|
||||
- latest `magatamallm` serverless run verified on Erik:
|
||||
- job id:
|
||||
- `ad003f90-3cf9-43f6-8960-bf6c1ea85097-e2`
|
||||
- registry truth in:
|
||||
- `/opt/magatama/training-data/model-registry/training-runs.json`
|
||||
- observed states:
|
||||
- `submitted`
|
||||
- then `completed_without_model_artifact`
|
||||
- exact recorded warning:
|
||||
- `RunPod meldete COMPLETED, aber das erwartete HuggingFace-Modellrepo wurde nicht gefunden.`
|
||||
- interpretation:
|
||||
- dataset build and RunPod submit are working
|
||||
- the worker still does not return a verifiable adoptable model artifact
|
||||
- this is a real training return-path failure, not just a cosmetic UI issue
|
||||
- local training API truth rechecked:
|
||||
- `GET http://127.0.0.1:3214/health`
|
||||
- service responds with:
|
||||
- `status = ok`
|
||||
- `service = magatama-train-api`
|
||||
- `running = false`
|
||||
- `pid = null`
|
||||
- meaning:
|
||||
- API is healthy/reachable
|
||||
- currently idle
|
||||
- ready for adoption/import calls once a valid RunPod artifact exists
|
||||
- one UI bug in the training modal was fixed live:
|
||||
- root cause:
|
||||
- during long `IN_PROGRESS` and post-`COMPLETED` artifact verification phases, MAGATAMA sent no heartbeat for too long
|
||||
- browser/proxy could then terminate the stream and surface only:
|
||||
- `network error`
|
||||
- even though Erik had already written the more truthful registry state
|
||||
- fix:
|
||||
- `magatama/packages/dashboard/src/server.ts`
|
||||
- added server-sent heartbeat messages while:
|
||||
- RunPod status remains unchanged
|
||||
- Hugging Face / artifact propagation checks are still running
|
||||
- concrete live strings now deployed in Erik dashboard server:
|
||||
- `⏳ RunPod arbeitet weiter (...)`
|
||||
- `⏳ Prüfe Modellartefakt ...`
|
||||
- deployment:
|
||||
- rebuilt dashboard
|
||||
- rsynced `packages/dashboard/dist/server.js` to Erik
|
||||
- restarted `pm2 magatama-dashboard`
|
||||
- remote `server.js` verified to contain heartbeat strings
|
||||
- expected operator effect:
|
||||
- future training runs should no longer collapse into a late generic `network error` while RunPod/adoption checks are still active
|
||||
- the UI should stay alive long enough to show the real terminal result:
|
||||
- `completed_and_adopted`
|
||||
- or
|
||||
- `completed_without_model_artifact`
|
||||
- or
|
||||
- worker/adoption failure
|
||||
|
||||
- MAGATAMA live follow-up on 2026-05-07:
|
||||
- local Mac training API was rechecked after the lane-specific automation changes.
|
||||
- current live truth:
|
||||
|
||||
@ -0,0 +1,125 @@
|
||||
# MAGATAMA RunPod Heartbeat and Real Terminal Status
|
||||
|
||||
Date: 2026-05-07 UTC
|
||||
|
||||
## Scope
|
||||
|
||||
- MAGATAMA dashboard training modal
|
||||
- RunPod serverless training status truth
|
||||
- local Mac training API sanity check
|
||||
|
||||
## What Was Observed
|
||||
|
||||
Latest verified `magatamallm` run:
|
||||
|
||||
- job id:
|
||||
- `ad003f90-3cf9-43f6-8960-bf6c1ea85097-e2`
|
||||
|
||||
On Erik, the run registry recorded:
|
||||
|
||||
- `submitted`
|
||||
- then:
|
||||
- `completed_without_model_artifact`
|
||||
|
||||
Registry source:
|
||||
|
||||
- `/opt/magatama/training-data/model-registry/training-runs.json`
|
||||
|
||||
Recorded warning:
|
||||
|
||||
- `RunPod meldete COMPLETED, aber das erwartete HuggingFace-Modellrepo wurde nicht gefunden.`
|
||||
|
||||
## Conclusion
|
||||
|
||||
This proves:
|
||||
|
||||
- training dataset build worked
|
||||
- RunPod submit worked
|
||||
- the return path still failed because no adoptable model artifact was verified
|
||||
|
||||
This was not just a cosmetic issue.
|
||||
|
||||
## Separate UI Failure
|
||||
|
||||
There was also a UX/runtime bug in the MAGATAMA dashboard modal:
|
||||
|
||||
- while RunPod stayed `IN_PROGRESS`
|
||||
- or while MAGATAMA waited for artifact visibility after `COMPLETED`
|
||||
|
||||
the SSE stream could go quiet too long.
|
||||
|
||||
Result:
|
||||
|
||||
- browser/proxy would terminate the stream
|
||||
- user only saw:
|
||||
- `network error`
|
||||
|
||||
even though Erik already had the more truthful internal status.
|
||||
|
||||
## Fix Applied
|
||||
|
||||
File:
|
||||
|
||||
- `magatama/packages/dashboard/src/server.ts`
|
||||
|
||||
Changes:
|
||||
|
||||
- added periodic SSE heartbeat messages while the RunPod status remains unchanged:
|
||||
- `⏳ RunPod arbeitet weiter (...)`
|
||||
- added periodic SSE heartbeat messages while MAGATAMA checks artifact visibility:
|
||||
- `⏳ Prüfe Modellartefakt ...`
|
||||
|
||||
## Local Training API Recheck
|
||||
|
||||
Local service:
|
||||
|
||||
- `http://127.0.0.1:3214/health`
|
||||
|
||||
Verified response:
|
||||
|
||||
- `status = ok`
|
||||
- `service = magatama-train-api`
|
||||
- `running = false`
|
||||
- `pid = null`
|
||||
|
||||
Interpretation:
|
||||
|
||||
- the local training/adoption API is healthy and reachable
|
||||
- it is currently idle, not broken
|
||||
- it is ready for adoption once a valid RunPod artifact exists
|
||||
|
||||
## Live Deployment
|
||||
|
||||
Deployed to Erik:
|
||||
|
||||
- rebuilt dashboard server
|
||||
- rsynced:
|
||||
- `/opt/magatama/packages/dashboard/dist/server.js`
|
||||
- restarted:
|
||||
- `pm2 restart magatama-dashboard`
|
||||
|
||||
Remote verification confirmed the new server bundle contains:
|
||||
|
||||
- `⏳ RunPod arbeitet weiter`
|
||||
|
||||
## Operational Impact
|
||||
|
||||
Future runs should no longer collapse into a misleading generic `network error` during long polling/verification silence.
|
||||
|
||||
The expected visible end states should now be the real ones:
|
||||
|
||||
- `completed_and_adopted`
|
||||
- `completed_without_model_artifact`
|
||||
- adoption failure
|
||||
- worker failure
|
||||
|
||||
## Remaining Hard Truth
|
||||
|
||||
The artifact/adoption problem itself is **not fixed yet**.
|
||||
|
||||
Current state:
|
||||
|
||||
- RunPod jobs can reach `COMPLETED`
|
||||
- but MAGATAMA still has no verified returned model artifact to import and version-switch
|
||||
|
||||
That remains the next required fix block.
|
||||
Loading…
x
Reference in New Issue
Block a user