transceiver-db/sync/history/2026-05-07-magatama-runpod-managed-endpoint-root-cause-and-custom-worker-path.md

# 2026-05-07 MAGATAMA RunPod Managed-Endpoint Root Cause and Custom Worker Path

## Summary

We continued the MAGATAMA RunPod training automation investigation live and closed the remaining ambiguity.

The current RunPod endpoint (`dheii186pfcuq7`) is the managed Axolotl serverless endpoint. It accepts jobs and reports lifecycle states such as `IN_QUEUE`, `IN_PROGRESS`, and `COMPLETED`, but it does **not** return a programmatically adoptable model artifact back to MAGATAMA.

That means:

- dataset refresh works
- lane-specific exports work
- training submit works
- local adoption API is healthy

But the full automation chain still breaks on the return path.

## Live Findings

### Attack Paths fix guidance

- `Open Fix Guidance` on Attack Paths was a placebo button.
- Fixed in:
  - `magatama/packages/dashboard/public/index-v2.html`
- Live behavior now:
  - opens the real finding/ticket drawer when the graph node maps to a finding
  - otherwise shows an explicit warning

### Local train API rechecked

- `GET http://127.0.0.1:3214/health`
- result:
  - `status = ok`
  - service reachable
  - service idle

Conclusion:

- local adoption/import service is not the current blocker

### RunPod raw status canary

A tiny direct canary was executed against the same endpoint:

- lane: `tip_llm`
- steps: `1`
- job:
  - `33434e85-3cc1-4dea-9043-83c315aaeb9c-e2`

Observed via raw `/status/{job}` polling:

- `IN_QUEUE`
- `IN_PROGRESS`
- `COMPLETED`

Critical detail:

- `/status/{job}` had no `output`
- `/stream/{job}` returned:
  - `{"status":"COMPLETED","stream":[]}`

This confirms that the current endpoint does not hand MAGATAMA the explicit artifact metadata it needs for automatic adoption.

### HF token check

Erik was checked directly:

- `/opt/magatama/secrets/hf-token`
  - exists
  - readable

Conclusion:

- the current failure is not a missing Hugging Face token on Erik

## Root Cause

The managed Axolotl serverless endpoint is not enough for MAGATAMA's required end-to-end automation.

MAGATAMA needs a worker that can:

1. train the lane-specific dataset
2. upload the resulting adapter/model artifact explicitly
3. return a machine-readable artifact reference
4. let MAGATAMA adopt/import that artifact
5. run smoke tests
6. bump version
7. switch the active alias

The managed Axolotl endpoint currently only gives lifecycle state, not an adoptable return artifact.

## Code Completed

Prepared the correct custom-worker path in:

- `magatama/packages/fine-tuner/train_cuda.py`
- `magatama/packages/fine-tuner/runpod_handler.py`
- `magatama/packages/fine-tuner/requirements-runpod.txt`
- `magatama/packages/dashboard/src/server.ts`

### What changed

- custom RunPod worker input now supports:
  - `target_model`
  - `credentials.hf_token`
- `train_cuda.py` now:
  - trains from the signed MAGATAMA lane bundle
  - uploads the resulting adapter folder to Hugging Face
  - returns `adapter_repo_id`
- dashboard custom-worker submit path now sends:
  - `run_id`
  - `target_model`
  - worker HF credential
- dashboard errors are now explicit when the managed endpoint completes without an adoptable artifact

## Live Deployment Status

Deployed live to Erik:

- rebuilt and rsynced dashboard server
- synced updated custom worker source files into repo state on Erik
- restarted `pm2 magatama-dashboard`

Not yet completed in infrastructure:

- the active RunPod endpoint itself is still the managed Axolotl endpoint

## Required Final Infra Step

To get true full automation:

1. build/publish:
   - `magatama/packages/fine-tuner/Dockerfile.runpod`
2. create or switch to a custom RunPod serverless endpoint running:
   - `runpod_handler.py`
3. set on Erik:
   - `RUNPOD_WORKER_KIND=custom-magatama`
   - `RUNPOD_ENDPOINT_ID=<custom-endpoint-id>`

Only then will MAGATAMA be able to:

- pull the lane-specific training pool
- train on RunPod
- get back a real adapter artifact
- adopt it locally into Ollama
- write a new version number
- repoint the active alias after smoke tests