Rene Fichtmueller 2a3576135c sync: record runpod managed endpoint root cause

2026-05-07 10:47:57 +02:00

3.9 KiB

Raw Blame History

2026-05-07 MAGATAMA RunPod Managed-Endpoint Root Cause and Custom Worker Path

Summary

We continued the MAGATAMA RunPod training automation investigation live and closed the remaining ambiguity.

The current RunPod endpoint (dheii186pfcuq7) is the managed Axolotl serverless endpoint. It accepts jobs and reports lifecycle states such as IN_QUEUE, IN_PROGRESS, and COMPLETED, but it does not return a programmatically adoptable model artifact back to MAGATAMA.

That means:

dataset refresh works
lane-specific exports work
training submit works
local adoption API is healthy

But the full automation chain still breaks on the return path.

Live Findings

Attack Paths fix guidance

Open Fix Guidance on Attack Paths was a placebo button.
Fixed in:
- magatama/packages/dashboard/public/index-v2.html
Live behavior now:
- opens the real finding/ticket drawer when the graph node maps to a finding
- otherwise shows an explicit warning

Local train API rechecked

GET http://127.0.0.1:3214/health
result:
- status = ok
- service reachable
- service idle

Conclusion:

local adoption/import service is not the current blocker

RunPod raw status canary

A tiny direct canary was executed against the same endpoint:

lane: tip_llm
steps: 1
job:
- 33434e85-3cc1-4dea-9043-83c315aaeb9c-e2

Observed via raw /status/{job} polling:

IN_QUEUE
IN_PROGRESS
COMPLETED

Critical detail:

/status/{job} had no output
/stream/{job} returned:
- {"status":"COMPLETED","stream":[]}

This confirms that the current endpoint does not hand MAGATAMA the explicit artifact metadata it needs for automatic adoption.

HF token check

Erik was checked directly:

/opt/magatama/secrets/hf-token
- exists
- readable

Conclusion:

the current failure is not a missing Hugging Face token on Erik

Root Cause

The managed Axolotl serverless endpoint is not enough for MAGATAMA's required end-to-end automation.

MAGATAMA needs a worker that can:

train the lane-specific dataset
upload the resulting adapter/model artifact explicitly
return a machine-readable artifact reference
let MAGATAMA adopt/import that artifact
run smoke tests
bump version
switch the active alias

The managed Axolotl endpoint currently only gives lifecycle state, not an adoptable return artifact.

Code Completed

Prepared the correct custom-worker path in:

magatama/packages/fine-tuner/train_cuda.py
magatama/packages/fine-tuner/runpod_handler.py
magatama/packages/fine-tuner/requirements-runpod.txt
magatama/packages/dashboard/src/server.ts

What changed

custom RunPod worker input now supports:
- target_model
- credentials.hf_token
train_cuda.py now:
- trains from the signed MAGATAMA lane bundle
- uploads the resulting adapter folder to Hugging Face
- returns adapter_repo_id
dashboard custom-worker submit path now sends:
- run_id
- target_model
- worker HF credential
dashboard errors are now explicit when the managed endpoint completes without an adoptable artifact

Live Deployment Status

Deployed live to Erik:

rebuilt and rsynced dashboard server
synced updated custom worker source files into repo state on Erik
restarted pm2 magatama-dashboard

Not yet completed in infrastructure:

the active RunPod endpoint itself is still the managed Axolotl endpoint

Required Final Infra Step

To get true full automation:

build/publish:
- magatama/packages/fine-tuner/Dockerfile.runpod
create or switch to a custom RunPod serverless endpoint running:
- runpod_handler.py
set on Erik:
- RUNPOD_WORKER_KIND=custom-magatama
- RUNPOD_ENDPOINT_ID=<custom-endpoint-id>

Only then will MAGATAMA be able to:

pull the lane-specific training pool
train on RunPod
get back a real adapter artifact
adopt it locally into Ollama
write a new version number
repoint the active alias after smoke tests

3.9 KiB Raw Blame History