# 2026-05-07 MAGATAMA RunPod Managed-Endpoint Root Cause and Custom Worker Path ## Summary We continued the MAGATAMA RunPod training automation investigation live and closed the remaining ambiguity. The current RunPod endpoint (`dheii186pfcuq7`) is the managed Axolotl serverless endpoint. It accepts jobs and reports lifecycle states such as `IN_QUEUE`, `IN_PROGRESS`, and `COMPLETED`, but it does **not** return a programmatically adoptable model artifact back to MAGATAMA. That means: - dataset refresh works - lane-specific exports work - training submit works - local adoption API is healthy But the full automation chain still breaks on the return path. ## Live Findings ### Attack Paths fix guidance - `Open Fix Guidance` on Attack Paths was a placebo button. - Fixed in: - `magatama/packages/dashboard/public/index-v2.html` - Live behavior now: - opens the real finding/ticket drawer when the graph node maps to a finding - otherwise shows an explicit warning ### Local train API rechecked - `GET http://127.0.0.1:3214/health` - result: - `status = ok` - service reachable - service idle Conclusion: - local adoption/import service is not the current blocker ### RunPod raw status canary A tiny direct canary was executed against the same endpoint: - lane: `tip_llm` - steps: `1` - job: - `33434e85-3cc1-4dea-9043-83c315aaeb9c-e2` Observed via raw `/status/{job}` polling: - `IN_QUEUE` - `IN_PROGRESS` - `COMPLETED` Critical detail: - `/status/{job}` had no `output` - `/stream/{job}` returned: - `{"status":"COMPLETED","stream":[]}` This confirms that the current endpoint does not hand MAGATAMA the explicit artifact metadata it needs for automatic adoption. ### HF token check Erik was checked directly: - `/opt/magatama/secrets/hf-token` - exists - readable Conclusion: - the current failure is not a missing Hugging Face token on Erik ## Root Cause The managed Axolotl serverless endpoint is not enough for MAGATAMA's required end-to-end automation. MAGATAMA needs a worker that can: 1. train the lane-specific dataset 2. upload the resulting adapter/model artifact explicitly 3. return a machine-readable artifact reference 4. let MAGATAMA adopt/import that artifact 5. run smoke tests 6. bump version 7. switch the active alias The managed Axolotl endpoint currently only gives lifecycle state, not an adoptable return artifact. ## Code Completed Prepared the correct custom-worker path in: - `magatama/packages/fine-tuner/train_cuda.py` - `magatama/packages/fine-tuner/runpod_handler.py` - `magatama/packages/fine-tuner/requirements-runpod.txt` - `magatama/packages/dashboard/src/server.ts` ### What changed - custom RunPod worker input now supports: - `target_model` - `credentials.hf_token` - `train_cuda.py` now: - trains from the signed MAGATAMA lane bundle - uploads the resulting adapter folder to Hugging Face - returns `adapter_repo_id` - dashboard custom-worker submit path now sends: - `run_id` - `target_model` - worker HF credential - dashboard errors are now explicit when the managed endpoint completes without an adoptable artifact ## Live Deployment Status Deployed live to Erik: - rebuilt and rsynced dashboard server - synced updated custom worker source files into repo state on Erik - restarted `pm2 magatama-dashboard` Not yet completed in infrastructure: - the active RunPod endpoint itself is still the managed Axolotl endpoint ## Required Final Infra Step To get true full automation: 1. build/publish: - `magatama/packages/fine-tuner/Dockerfile.runpod` 2. create or switch to a custom RunPod serverless endpoint running: - `runpod_handler.py` 3. set on Erik: - `RUNPOD_WORKER_KIND=custom-magatama` - `RUNPOD_ENDPOINT_ID=` Only then will MAGATAMA be able to: - pull the lane-specific training pool - train on RunPod - get back a real adapter artifact - adopt it locally into Ollama - write a new version number - repoint the active alias after smoke tests