148 lines
3.9 KiB
Markdown
148 lines
3.9 KiB
Markdown
# 2026-05-07 MAGATAMA RunPod Managed-Endpoint Root Cause and Custom Worker Path
|
|
|
|
## Summary
|
|
|
|
We continued the MAGATAMA RunPod training automation investigation live and closed the remaining ambiguity.
|
|
|
|
The current RunPod endpoint (`dheii186pfcuq7`) is the managed Axolotl serverless endpoint. It accepts jobs and reports lifecycle states such as `IN_QUEUE`, `IN_PROGRESS`, and `COMPLETED`, but it does **not** return a programmatically adoptable model artifact back to MAGATAMA.
|
|
|
|
That means:
|
|
|
|
- dataset refresh works
|
|
- lane-specific exports work
|
|
- training submit works
|
|
- local adoption API is healthy
|
|
|
|
But the full automation chain still breaks on the return path.
|
|
|
|
## Live Findings
|
|
|
|
### Attack Paths fix guidance
|
|
|
|
- `Open Fix Guidance` on Attack Paths was a placebo button.
|
|
- Fixed in:
|
|
- `magatama/packages/dashboard/public/index-v2.html`
|
|
- Live behavior now:
|
|
- opens the real finding/ticket drawer when the graph node maps to a finding
|
|
- otherwise shows an explicit warning
|
|
|
|
### Local train API rechecked
|
|
|
|
- `GET http://127.0.0.1:3214/health`
|
|
- result:
|
|
- `status = ok`
|
|
- service reachable
|
|
- service idle
|
|
|
|
Conclusion:
|
|
|
|
- local adoption/import service is not the current blocker
|
|
|
|
### RunPod raw status canary
|
|
|
|
A tiny direct canary was executed against the same endpoint:
|
|
|
|
- lane: `tip_llm`
|
|
- steps: `1`
|
|
- job:
|
|
- `33434e85-3cc1-4dea-9043-83c315aaeb9c-e2`
|
|
|
|
Observed via raw `/status/{job}` polling:
|
|
|
|
- `IN_QUEUE`
|
|
- `IN_PROGRESS`
|
|
- `COMPLETED`
|
|
|
|
Critical detail:
|
|
|
|
- `/status/{job}` had no `output`
|
|
- `/stream/{job}` returned:
|
|
- `{"status":"COMPLETED","stream":[]}`
|
|
|
|
This confirms that the current endpoint does not hand MAGATAMA the explicit artifact metadata it needs for automatic adoption.
|
|
|
|
### HF token check
|
|
|
|
Erik was checked directly:
|
|
|
|
- `/opt/magatama/secrets/hf-token`
|
|
- exists
|
|
- readable
|
|
|
|
Conclusion:
|
|
|
|
- the current failure is not a missing Hugging Face token on Erik
|
|
|
|
## Root Cause
|
|
|
|
The managed Axolotl serverless endpoint is not enough for MAGATAMA's required end-to-end automation.
|
|
|
|
MAGATAMA needs a worker that can:
|
|
|
|
1. train the lane-specific dataset
|
|
2. upload the resulting adapter/model artifact explicitly
|
|
3. return a machine-readable artifact reference
|
|
4. let MAGATAMA adopt/import that artifact
|
|
5. run smoke tests
|
|
6. bump version
|
|
7. switch the active alias
|
|
|
|
The managed Axolotl endpoint currently only gives lifecycle state, not an adoptable return artifact.
|
|
|
|
## Code Completed
|
|
|
|
Prepared the correct custom-worker path in:
|
|
|
|
- `magatama/packages/fine-tuner/train_cuda.py`
|
|
- `magatama/packages/fine-tuner/runpod_handler.py`
|
|
- `magatama/packages/fine-tuner/requirements-runpod.txt`
|
|
- `magatama/packages/dashboard/src/server.ts`
|
|
|
|
### What changed
|
|
|
|
- custom RunPod worker input now supports:
|
|
- `target_model`
|
|
- `credentials.hf_token`
|
|
- `train_cuda.py` now:
|
|
- trains from the signed MAGATAMA lane bundle
|
|
- uploads the resulting adapter folder to Hugging Face
|
|
- returns `adapter_repo_id`
|
|
- dashboard custom-worker submit path now sends:
|
|
- `run_id`
|
|
- `target_model`
|
|
- worker HF credential
|
|
- dashboard errors are now explicit when the managed endpoint completes without an adoptable artifact
|
|
|
|
## Live Deployment Status
|
|
|
|
Deployed live to Erik:
|
|
|
|
- rebuilt and rsynced dashboard server
|
|
- synced updated custom worker source files into repo state on Erik
|
|
- restarted `pm2 magatama-dashboard`
|
|
|
|
Not yet completed in infrastructure:
|
|
|
|
- the active RunPod endpoint itself is still the managed Axolotl endpoint
|
|
|
|
## Required Final Infra Step
|
|
|
|
To get true full automation:
|
|
|
|
1. build/publish:
|
|
- `magatama/packages/fine-tuner/Dockerfile.runpod`
|
|
2. create or switch to a custom RunPod serverless endpoint running:
|
|
- `runpod_handler.py`
|
|
3. set on Erik:
|
|
- `RUNPOD_WORKER_KIND=custom-magatama`
|
|
- `RUNPOD_ENDPOINT_ID=<custom-endpoint-id>`
|
|
|
|
Only then will MAGATAMA be able to:
|
|
|
|
- pull the lane-specific training pool
|
|
- train on RunPod
|
|
- get back a real adapter artifact
|
|
- adopt it locally into Ollama
|
|
- write a new version number
|
|
- repoint the active alias after smoke tests
|