diff --git a/sync/CURRENT.md b/sync/CURRENT.md index 918afe1..cdb59a6 100644 --- a/sync/CURRENT.md +++ b/sync/CURRENT.md @@ -70,6 +70,26 @@ When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infr - Hugging Face dataset source - or MAGATAMA URL-bundle dataset source - this avoids misleading `RunPod not reachable` wording when the actual failure is in dataset preparation. + - follow-up serverless verification on 2026-05-06 narrowed the remaining fault further: + - MAGATAMA submit logic now verifies that a RunPod job really exists under `/status/{jobId}` instead of trusting `/run`. + - payloads were aligned more closely with the official Axolotl serverless schema: + - `model_type=AutoModelForCausalLM` + - `tokenizer_type=AutoTokenizer` + - dataset `split: train` + - optimizer `adamw_torch_fused` + - verified full run attempt: + - job id `9bc4b16b-755b-465b-aadf-b46f2fe467a3-e2` + - disappeared as `not_found_after_submit` (`404 job not found`) + - verified canary after payload fix: + - job id `a4ac6951-7ed7-43cb-80d8-5ab61533c2da-e2` + - immediately materialized as `IN_QUEUE` + - then still disappeared on later reconcile as `not_found_after_submit` + - current conclusion: + - the old MAGATAMA bug is fixed. + - the remaining problem is now likely on the RunPod endpoint/release side: jobs are accepted and briefly queued, but do not survive long enough to produce a durable serverless status lifecycle. + - operational rule: + - do not treat `submitted` or a brief `IN_QUEUE` as proof of a usable serverless training run. + - only trust the run once it reaches `IN_PROGRESS` or a durable terminal state with artifact evidence. - MAGATAMA was repaired end-to-end to a clean operational baseline: - live guard host-audits for Erik, Mac Studio, and Proxmox were corrected and rerun. diff --git a/sync/history/2026-05-06-magatama-runpod-serverless-materialization-check.md b/sync/history/2026-05-06-magatama-runpod-serverless-materialization-check.md new file mode 100644 index 0000000..717e41e --- /dev/null +++ b/sync/history/2026-05-06-magatama-runpod-serverless-materialization-check.md @@ -0,0 +1,65 @@ +# 2026-05-06 — MAGATAMA RunPod serverless materialization check + +## Summary + +MAGATAMA's RunPod submit path was hardened and re-tested against the queue-based Axolotl serverless endpoint `dheii186pfcuq7`. + +## What changed + +- Payload alignment was tightened toward the official Axolotl serverless schema: + - added `model_type=AutoModelForCausalLM` + - added `tokenizer_type=AutoTokenizer` + - switched dataset split declaration to `split: train` + - switched optimizer from `adamw_8bit` to `adamw_torch_fused` +- Both submit paths now distinguish between: + - `/run` accepted + - `/status/{job}` actually exists +- Updated files: + - `magatama/packages/dashboard/src/server.ts` + - `magatama/scripts/submit_runpod_training.ts` + +## Verified behavior + +### Full run attempt + +- Submitted `magatamallm` 500-step run. +- Returned job id: `9bc4b16b-755b-465b-aadf-b46f2fe467a3-e2` +- Reconcile result shortly after: + - `not_found_after_submit` + - HTTP `404` + - `job not found` + +### Canary run after payload/schema fix + +- Submitted `magatamallm` seed-only canary. +- Returned job id: `a4ac6951-7ed7-43cb-80d8-5ab61533c2da-e2` +- Immediate submit-side verification saw real queue materialization: + - `runpod_status: IN_QUEUE` +- Reconcile roughly 45 seconds later still observed: + - `not_found_after_submit` + - HTTP `404` + - `job not found` + +## Conclusion + +The old MAGATAMA bug (blindly trusting `/run`) is fixed. + +The remaining problem is now narrower and likely external to MAGATAMA itself: + +- RunPod serverless currently accepts the submit and briefly materializes the job as `IN_QUEUE`, +- but the job disappears before a durable status/progress/completion lifecycle can be observed. + +This means the endpoint/release is still not trustworthy enough for a full production training launch until it can keep a job alive beyond the initial queue stage. + +## Operational rule + +Do **not** treat `submitted` or even a brief `IN_QUEUE` as proof of a usable serverless training run. +A MAGATAMA serverless training run is only trustworthy when at least one of these is true: + +- status progresses to `IN_PROGRESS`, or +- a durable terminal state is observed with artifact evidence. + +## Open next step + +- Inspect the actual RunPod serverless endpoint/release configuration and worker-side logs in RunPod UI. +- Only launch the full MagatamaLLM run after a canary survives beyond queue materialization.