# 2026-05-06 — MAGATAMA RunPod serverless materialization check ## Summary MAGATAMA's RunPod submit path was hardened and re-tested against the queue-based Axolotl serverless endpoint `dheii186pfcuq7`. ## What changed - Payload alignment was tightened toward the official Axolotl serverless schema: - added `model_type=AutoModelForCausalLM` - added `tokenizer_type=AutoTokenizer` - switched dataset split declaration to `split: train` - switched optimizer from `adamw_8bit` to `adamw_torch_fused` - Both submit paths now distinguish between: - `/run` accepted - `/status/{job}` actually exists - Updated files: - `magatama/packages/dashboard/src/server.ts` - `magatama/scripts/submit_runpod_training.ts` ## Verified behavior ### Full run attempt - Submitted `magatamallm` 500-step run. - Returned job id: `9bc4b16b-755b-465b-aadf-b46f2fe467a3-e2` - Reconcile result shortly after: - `not_found_after_submit` - HTTP `404` - `job not found` ### Canary run after payload/schema fix - Submitted `magatamallm` seed-only canary. - Returned job id: `a4ac6951-7ed7-43cb-80d8-5ab61533c2da-e2` - Immediate submit-side verification saw real queue materialization: - `runpod_status: IN_QUEUE` - Reconcile roughly 45 seconds later still observed: - `not_found_after_submit` - HTTP `404` - `job not found` ## Conclusion The old MAGATAMA bug (blindly trusting `/run`) is fixed. The remaining problem is now narrower and likely external to MAGATAMA itself: - RunPod serverless currently accepts the submit and briefly materializes the job as `IN_QUEUE`, - but the job disappears before a durable status/progress/completion lifecycle can be observed. This means the endpoint/release is still not trustworthy enough for a full production training launch until it can keep a job alive beyond the initial queue stage. ## Operational rule Do **not** treat `submitted` or even a brief `IN_QUEUE` as proof of a usable serverless training run. A MAGATAMA serverless training run is only trustworthy when at least one of these is true: - status progresses to `IN_PROGRESS`, or - a durable terminal state is observed with artifact evidence. ## Open next step - Inspect the actual RunPod serverless endpoint/release configuration and worker-side logs in RunPod UI. - Only launch the full MagatamaLLM run after a canary survives beyond queue materialization.