2.3 KiB
2.3 KiB
2026-05-06 — MAGATAMA RunPod serverless materialization check
Summary
MAGATAMA's RunPod submit path was hardened and re-tested against the queue-based Axolotl serverless endpoint dheii186pfcuq7.
What changed
- Payload alignment was tightened toward the official Axolotl serverless schema:
- added
model_type=AutoModelForCausalLM - added
tokenizer_type=AutoTokenizer - switched dataset split declaration to
split: train - switched optimizer from
adamw_8bittoadamw_torch_fused
- added
- Both submit paths now distinguish between:
/runaccepted/status/{job}actually exists
- Updated files:
magatama/packages/dashboard/src/server.tsmagatama/scripts/submit_runpod_training.ts
Verified behavior
Full run attempt
- Submitted
magatamallm500-step run. - Returned job id:
9bc4b16b-755b-465b-aadf-b46f2fe467a3-e2 - Reconcile result shortly after:
not_found_after_submit- HTTP
404 job not found
Canary run after payload/schema fix
- Submitted
magatamallmseed-only canary. - Returned job id:
a4ac6951-7ed7-43cb-80d8-5ab61533c2da-e2 - Immediate submit-side verification saw real queue materialization:
runpod_status: IN_QUEUE
- Reconcile roughly 45 seconds later still observed:
not_found_after_submit- HTTP
404 job not found
Conclusion
The old MAGATAMA bug (blindly trusting /run) is fixed.
The remaining problem is now narrower and likely external to MAGATAMA itself:
- RunPod serverless currently accepts the submit and briefly materializes the job as
IN_QUEUE, - but the job disappears before a durable status/progress/completion lifecycle can be observed.
This means the endpoint/release is still not trustworthy enough for a full production training launch until it can keep a job alive beyond the initial queue stage.
Operational rule
Do not treat submitted or even a brief IN_QUEUE as proof of a usable serverless training run.
A MAGATAMA serverless training run is only trustworthy when at least one of these is true:
- status progresses to
IN_PROGRESS, or - a durable terminal state is observed with artifact evidence.
Open next step
- Inspect the actual RunPod serverless endpoint/release configuration and worker-side logs in RunPod UI.
- Only launch the full MagatamaLLM run after a canary survives beyond queue materialization.