3.9 KiB
2026-05-07 MAGATAMA RunPod Managed-Endpoint Root Cause and Custom Worker Path
Summary
We continued the MAGATAMA RunPod training automation investigation live and closed the remaining ambiguity.
The current RunPod endpoint (dheii186pfcuq7) is the managed Axolotl serverless endpoint. It accepts jobs and reports lifecycle states such as IN_QUEUE, IN_PROGRESS, and COMPLETED, but it does not return a programmatically adoptable model artifact back to MAGATAMA.
That means:
- dataset refresh works
- lane-specific exports work
- training submit works
- local adoption API is healthy
But the full automation chain still breaks on the return path.
Live Findings
Attack Paths fix guidance
Open Fix Guidanceon Attack Paths was a placebo button.- Fixed in:
magatama/packages/dashboard/public/index-v2.html
- Live behavior now:
- opens the real finding/ticket drawer when the graph node maps to a finding
- otherwise shows an explicit warning
Local train API rechecked
GET http://127.0.0.1:3214/health- result:
status = ok- service reachable
- service idle
Conclusion:
- local adoption/import service is not the current blocker
RunPod raw status canary
A tiny direct canary was executed against the same endpoint:
- lane:
tip_llm - steps:
1 - job:
33434e85-3cc1-4dea-9043-83c315aaeb9c-e2
Observed via raw /status/{job} polling:
IN_QUEUEIN_PROGRESSCOMPLETED
Critical detail:
/status/{job}had nooutput/stream/{job}returned:{"status":"COMPLETED","stream":[]}
This confirms that the current endpoint does not hand MAGATAMA the explicit artifact metadata it needs for automatic adoption.
HF token check
Erik was checked directly:
/opt/magatama/secrets/hf-token- exists
- readable
Conclusion:
- the current failure is not a missing Hugging Face token on Erik
Root Cause
The managed Axolotl serverless endpoint is not enough for MAGATAMA's required end-to-end automation.
MAGATAMA needs a worker that can:
- train the lane-specific dataset
- upload the resulting adapter/model artifact explicitly
- return a machine-readable artifact reference
- let MAGATAMA adopt/import that artifact
- run smoke tests
- bump version
- switch the active alias
The managed Axolotl endpoint currently only gives lifecycle state, not an adoptable return artifact.
Code Completed
Prepared the correct custom-worker path in:
magatama/packages/fine-tuner/train_cuda.pymagatama/packages/fine-tuner/runpod_handler.pymagatama/packages/fine-tuner/requirements-runpod.txtmagatama/packages/dashboard/src/server.ts
What changed
- custom RunPod worker input now supports:
target_modelcredentials.hf_token
train_cuda.pynow:- trains from the signed MAGATAMA lane bundle
- uploads the resulting adapter folder to Hugging Face
- returns
adapter_repo_id
- dashboard custom-worker submit path now sends:
run_idtarget_model- worker HF credential
- dashboard errors are now explicit when the managed endpoint completes without an adoptable artifact
Live Deployment Status
Deployed live to Erik:
- rebuilt and rsynced dashboard server
- synced updated custom worker source files into repo state on Erik
- restarted
pm2 magatama-dashboard
Not yet completed in infrastructure:
- the active RunPod endpoint itself is still the managed Axolotl endpoint
Required Final Infra Step
To get true full automation:
- build/publish:
magatama/packages/fine-tuner/Dockerfile.runpod
- create or switch to a custom RunPod serverless endpoint running:
runpod_handler.py
- set on Erik:
RUNPOD_WORKER_KIND=custom-magatamaRUNPOD_ENDPOINT_ID=<custom-endpoint-id>
Only then will MAGATAMA be able to:
- pull the lane-specific training pool
- train on RunPod
- get back a real adapter artifact
- adopt it locally into Ollama
- write a new version number
- repoint the active alias after smoke tests