transceiver-db/sync/history/2026-05-07-magatama-runpod-managed-endpoint-root-cause-and-custom-worker-path.md
2026-05-07 10:47:57 +02:00

148 lines
3.9 KiB
Markdown

# 2026-05-07 MAGATAMA RunPod Managed-Endpoint Root Cause and Custom Worker Path
## Summary
We continued the MAGATAMA RunPod training automation investigation live and closed the remaining ambiguity.
The current RunPod endpoint (`dheii186pfcuq7`) is the managed Axolotl serverless endpoint. It accepts jobs and reports lifecycle states such as `IN_QUEUE`, `IN_PROGRESS`, and `COMPLETED`, but it does **not** return a programmatically adoptable model artifact back to MAGATAMA.
That means:
- dataset refresh works
- lane-specific exports work
- training submit works
- local adoption API is healthy
But the full automation chain still breaks on the return path.
## Live Findings
### Attack Paths fix guidance
- `Open Fix Guidance` on Attack Paths was a placebo button.
- Fixed in:
- `magatama/packages/dashboard/public/index-v2.html`
- Live behavior now:
- opens the real finding/ticket drawer when the graph node maps to a finding
- otherwise shows an explicit warning
### Local train API rechecked
- `GET http://127.0.0.1:3214/health`
- result:
- `status = ok`
- service reachable
- service idle
Conclusion:
- local adoption/import service is not the current blocker
### RunPod raw status canary
A tiny direct canary was executed against the same endpoint:
- lane: `tip_llm`
- steps: `1`
- job:
- `33434e85-3cc1-4dea-9043-83c315aaeb9c-e2`
Observed via raw `/status/{job}` polling:
- `IN_QUEUE`
- `IN_PROGRESS`
- `COMPLETED`
Critical detail:
- `/status/{job}` had no `output`
- `/stream/{job}` returned:
- `{"status":"COMPLETED","stream":[]}`
This confirms that the current endpoint does not hand MAGATAMA the explicit artifact metadata it needs for automatic adoption.
### HF token check
Erik was checked directly:
- `/opt/magatama/secrets/hf-token`
- exists
- readable
Conclusion:
- the current failure is not a missing Hugging Face token on Erik
## Root Cause
The managed Axolotl serverless endpoint is not enough for MAGATAMA's required end-to-end automation.
MAGATAMA needs a worker that can:
1. train the lane-specific dataset
2. upload the resulting adapter/model artifact explicitly
3. return a machine-readable artifact reference
4. let MAGATAMA adopt/import that artifact
5. run smoke tests
6. bump version
7. switch the active alias
The managed Axolotl endpoint currently only gives lifecycle state, not an adoptable return artifact.
## Code Completed
Prepared the correct custom-worker path in:
- `magatama/packages/fine-tuner/train_cuda.py`
- `magatama/packages/fine-tuner/runpod_handler.py`
- `magatama/packages/fine-tuner/requirements-runpod.txt`
- `magatama/packages/dashboard/src/server.ts`
### What changed
- custom RunPod worker input now supports:
- `target_model`
- `credentials.hf_token`
- `train_cuda.py` now:
- trains from the signed MAGATAMA lane bundle
- uploads the resulting adapter folder to Hugging Face
- returns `adapter_repo_id`
- dashboard custom-worker submit path now sends:
- `run_id`
- `target_model`
- worker HF credential
- dashboard errors are now explicit when the managed endpoint completes without an adoptable artifact
## Live Deployment Status
Deployed live to Erik:
- rebuilt and rsynced dashboard server
- synced updated custom worker source files into repo state on Erik
- restarted `pm2 magatama-dashboard`
Not yet completed in infrastructure:
- the active RunPod endpoint itself is still the managed Axolotl endpoint
## Required Final Infra Step
To get true full automation:
1. build/publish:
- `magatama/packages/fine-tuner/Dockerfile.runpod`
2. create or switch to a custom RunPod serverless endpoint running:
- `runpod_handler.py`
3. set on Erik:
- `RUNPOD_WORKER_KIND=custom-magatama`
- `RUNPOD_ENDPOINT_ID=<custom-endpoint-id>`
Only then will MAGATAMA be able to:
- pull the lane-specific training pool
- train on RunPod
- get back a real adapter artifact
- adopt it locally into Ollama
- write a new version number
- repoint the active alias after smoke tests