2.7 KiB
MAGATAMA RunPod Heartbeat and Real Terminal Status
Date: 2026-05-07 UTC
Scope
- MAGATAMA dashboard training modal
- RunPod serverless training status truth
- local Mac training API sanity check
What Was Observed
Latest verified magatamallm run:
- job id:
ad003f90-3cf9-43f6-8960-bf6c1ea85097-e2
On Erik, the run registry recorded:
submitted- then:
completed_without_model_artifact
Registry source:
/opt/magatama/training-data/model-registry/training-runs.json
Recorded warning:
RunPod meldete COMPLETED, aber das erwartete HuggingFace-Modellrepo wurde nicht gefunden.
Conclusion
This proves:
- training dataset build worked
- RunPod submit worked
- the return path still failed because no adoptable model artifact was verified
This was not just a cosmetic issue.
Separate UI Failure
There was also a UX/runtime bug in the MAGATAMA dashboard modal:
- while RunPod stayed
IN_PROGRESS - or while MAGATAMA waited for artifact visibility after
COMPLETED
the SSE stream could go quiet too long.
Result:
- browser/proxy would terminate the stream
- user only saw:
network error
even though Erik already had the more truthful internal status.
Fix Applied
File:
magatama/packages/dashboard/src/server.ts
Changes:
- added periodic SSE heartbeat messages while the RunPod status remains unchanged:
⏳ RunPod arbeitet weiter (...)
- added periodic SSE heartbeat messages while MAGATAMA checks artifact visibility:
⏳ Prüfe Modellartefakt ...
Local Training API Recheck
Local service:
http://127.0.0.1:3214/health
Verified response:
status = okservice = magatama-train-apirunning = falsepid = null
Interpretation:
- the local training/adoption API is healthy and reachable
- it is currently idle, not broken
- it is ready for adoption once a valid RunPod artifact exists
Live Deployment
Deployed to Erik:
- rebuilt dashboard server
- rsynced:
/opt/magatama/packages/dashboard/dist/server.js
- restarted:
pm2 restart magatama-dashboard
Remote verification confirmed the new server bundle contains:
⏳ RunPod arbeitet weiter
Operational Impact
Future runs should no longer collapse into a misleading generic network error during long polling/verification silence.
The expected visible end states should now be the real ones:
completed_and_adoptedcompleted_without_model_artifact- adoption failure
- worker failure
Remaining Hard Truth
The artifact/adoption problem itself is not fixed yet.
Current state:
- RunPod jobs can reach
COMPLETED - but MAGATAMA still has no verified returned model artifact to import and version-switch
That remains the next required fix block.