126 lines
2.7 KiB
Markdown
126 lines
2.7 KiB
Markdown
# MAGATAMA RunPod Heartbeat and Real Terminal Status
|
|
|
|
Date: 2026-05-07 UTC
|
|
|
|
## Scope
|
|
|
|
- MAGATAMA dashboard training modal
|
|
- RunPod serverless training status truth
|
|
- local Mac training API sanity check
|
|
|
|
## What Was Observed
|
|
|
|
Latest verified `magatamallm` run:
|
|
|
|
- job id:
|
|
- `ad003f90-3cf9-43f6-8960-bf6c1ea85097-e2`
|
|
|
|
On Erik, the run registry recorded:
|
|
|
|
- `submitted`
|
|
- then:
|
|
- `completed_without_model_artifact`
|
|
|
|
Registry source:
|
|
|
|
- `/opt/magatama/training-data/model-registry/training-runs.json`
|
|
|
|
Recorded warning:
|
|
|
|
- `RunPod meldete COMPLETED, aber das erwartete HuggingFace-Modellrepo wurde nicht gefunden.`
|
|
|
|
## Conclusion
|
|
|
|
This proves:
|
|
|
|
- training dataset build worked
|
|
- RunPod submit worked
|
|
- the return path still failed because no adoptable model artifact was verified
|
|
|
|
This was not just a cosmetic issue.
|
|
|
|
## Separate UI Failure
|
|
|
|
There was also a UX/runtime bug in the MAGATAMA dashboard modal:
|
|
|
|
- while RunPod stayed `IN_PROGRESS`
|
|
- or while MAGATAMA waited for artifact visibility after `COMPLETED`
|
|
|
|
the SSE stream could go quiet too long.
|
|
|
|
Result:
|
|
|
|
- browser/proxy would terminate the stream
|
|
- user only saw:
|
|
- `network error`
|
|
|
|
even though Erik already had the more truthful internal status.
|
|
|
|
## Fix Applied
|
|
|
|
File:
|
|
|
|
- `magatama/packages/dashboard/src/server.ts`
|
|
|
|
Changes:
|
|
|
|
- added periodic SSE heartbeat messages while the RunPod status remains unchanged:
|
|
- `⏳ RunPod arbeitet weiter (...)`
|
|
- added periodic SSE heartbeat messages while MAGATAMA checks artifact visibility:
|
|
- `⏳ Prüfe Modellartefakt ...`
|
|
|
|
## Local Training API Recheck
|
|
|
|
Local service:
|
|
|
|
- `http://127.0.0.1:3214/health`
|
|
|
|
Verified response:
|
|
|
|
- `status = ok`
|
|
- `service = magatama-train-api`
|
|
- `running = false`
|
|
- `pid = null`
|
|
|
|
Interpretation:
|
|
|
|
- the local training/adoption API is healthy and reachable
|
|
- it is currently idle, not broken
|
|
- it is ready for adoption once a valid RunPod artifact exists
|
|
|
|
## Live Deployment
|
|
|
|
Deployed to Erik:
|
|
|
|
- rebuilt dashboard server
|
|
- rsynced:
|
|
- `/opt/magatama/packages/dashboard/dist/server.js`
|
|
- restarted:
|
|
- `pm2 restart magatama-dashboard`
|
|
|
|
Remote verification confirmed the new server bundle contains:
|
|
|
|
- `⏳ RunPod arbeitet weiter`
|
|
|
|
## Operational Impact
|
|
|
|
Future runs should no longer collapse into a misleading generic `network error` during long polling/verification silence.
|
|
|
|
The expected visible end states should now be the real ones:
|
|
|
|
- `completed_and_adopted`
|
|
- `completed_without_model_artifact`
|
|
- adoption failure
|
|
- worker failure
|
|
|
|
## Remaining Hard Truth
|
|
|
|
The artifact/adoption problem itself is **not fixed yet**.
|
|
|
|
Current state:
|
|
|
|
- RunPod jobs can reach `COMPLETED`
|
|
- but MAGATAMA still has no verified returned model artifact to import and version-switch
|
|
|
|
That remains the next required fix block.
|