sync: record runpod serverless materialization check
This commit is contained in:
parent
b5d9b4df03
commit
9bc84a89ee
@ -70,6 +70,26 @@ When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infr
|
||||
- Hugging Face dataset source
|
||||
- or MAGATAMA URL-bundle dataset source
|
||||
- this avoids misleading `RunPod not reachable` wording when the actual failure is in dataset preparation.
|
||||
- follow-up serverless verification on 2026-05-06 narrowed the remaining fault further:
|
||||
- MAGATAMA submit logic now verifies that a RunPod job really exists under `/status/{jobId}` instead of trusting `/run`.
|
||||
- payloads were aligned more closely with the official Axolotl serverless schema:
|
||||
- `model_type=AutoModelForCausalLM`
|
||||
- `tokenizer_type=AutoTokenizer`
|
||||
- dataset `split: train`
|
||||
- optimizer `adamw_torch_fused`
|
||||
- verified full run attempt:
|
||||
- job id `9bc4b16b-755b-465b-aadf-b46f2fe467a3-e2`
|
||||
- disappeared as `not_found_after_submit` (`404 job not found`)
|
||||
- verified canary after payload fix:
|
||||
- job id `a4ac6951-7ed7-43cb-80d8-5ab61533c2da-e2`
|
||||
- immediately materialized as `IN_QUEUE`
|
||||
- then still disappeared on later reconcile as `not_found_after_submit`
|
||||
- current conclusion:
|
||||
- the old MAGATAMA bug is fixed.
|
||||
- the remaining problem is now likely on the RunPod endpoint/release side: jobs are accepted and briefly queued, but do not survive long enough to produce a durable serverless status lifecycle.
|
||||
- operational rule:
|
||||
- do not treat `submitted` or a brief `IN_QUEUE` as proof of a usable serverless training run.
|
||||
- only trust the run once it reaches `IN_PROGRESS` or a durable terminal state with artifact evidence.
|
||||
|
||||
- MAGATAMA was repaired end-to-end to a clean operational baseline:
|
||||
- live guard host-audits for Erik, Mac Studio, and Proxmox were corrected and rerun.
|
||||
|
||||
@ -0,0 +1,65 @@
|
||||
# 2026-05-06 — MAGATAMA RunPod serverless materialization check
|
||||
|
||||
## Summary
|
||||
|
||||
MAGATAMA's RunPod submit path was hardened and re-tested against the queue-based Axolotl serverless endpoint `dheii186pfcuq7`.
|
||||
|
||||
## What changed
|
||||
|
||||
- Payload alignment was tightened toward the official Axolotl serverless schema:
|
||||
- added `model_type=AutoModelForCausalLM`
|
||||
- added `tokenizer_type=AutoTokenizer`
|
||||
- switched dataset split declaration to `split: train`
|
||||
- switched optimizer from `adamw_8bit` to `adamw_torch_fused`
|
||||
- Both submit paths now distinguish between:
|
||||
- `/run` accepted
|
||||
- `/status/{job}` actually exists
|
||||
- Updated files:
|
||||
- `magatama/packages/dashboard/src/server.ts`
|
||||
- `magatama/scripts/submit_runpod_training.ts`
|
||||
|
||||
## Verified behavior
|
||||
|
||||
### Full run attempt
|
||||
|
||||
- Submitted `magatamallm` 500-step run.
|
||||
- Returned job id: `9bc4b16b-755b-465b-aadf-b46f2fe467a3-e2`
|
||||
- Reconcile result shortly after:
|
||||
- `not_found_after_submit`
|
||||
- HTTP `404`
|
||||
- `job not found`
|
||||
|
||||
### Canary run after payload/schema fix
|
||||
|
||||
- Submitted `magatamallm` seed-only canary.
|
||||
- Returned job id: `a4ac6951-7ed7-43cb-80d8-5ab61533c2da-e2`
|
||||
- Immediate submit-side verification saw real queue materialization:
|
||||
- `runpod_status: IN_QUEUE`
|
||||
- Reconcile roughly 45 seconds later still observed:
|
||||
- `not_found_after_submit`
|
||||
- HTTP `404`
|
||||
- `job not found`
|
||||
|
||||
## Conclusion
|
||||
|
||||
The old MAGATAMA bug (blindly trusting `/run`) is fixed.
|
||||
|
||||
The remaining problem is now narrower and likely external to MAGATAMA itself:
|
||||
|
||||
- RunPod serverless currently accepts the submit and briefly materializes the job as `IN_QUEUE`,
|
||||
- but the job disappears before a durable status/progress/completion lifecycle can be observed.
|
||||
|
||||
This means the endpoint/release is still not trustworthy enough for a full production training launch until it can keep a job alive beyond the initial queue stage.
|
||||
|
||||
## Operational rule
|
||||
|
||||
Do **not** treat `submitted` or even a brief `IN_QUEUE` as proof of a usable serverless training run.
|
||||
A MAGATAMA serverless training run is only trustworthy when at least one of these is true:
|
||||
|
||||
- status progresses to `IN_PROGRESS`, or
|
||||
- a durable terminal state is observed with artifact evidence.
|
||||
|
||||
## Open next step
|
||||
|
||||
- Inspect the actual RunPod serverless endpoint/release configuration and worker-side logs in RunPod UI.
|
||||
- Only launch the full MagatamaLLM run after a canary survives beyond queue materialization.
|
||||
Loading…
x
Reference in New Issue
Block a user