transceiver-db/sync/history/2026-05-07-magatama-runpod-managed-endpoint-root-cause-and-custom-worker-path.md
2026-05-07 10:47:57 +02:00

3.9 KiB

2026-05-07 MAGATAMA RunPod Managed-Endpoint Root Cause and Custom Worker Path

Summary

We continued the MAGATAMA RunPod training automation investigation live and closed the remaining ambiguity.

The current RunPod endpoint (dheii186pfcuq7) is the managed Axolotl serverless endpoint. It accepts jobs and reports lifecycle states such as IN_QUEUE, IN_PROGRESS, and COMPLETED, but it does not return a programmatically adoptable model artifact back to MAGATAMA.

That means:

  • dataset refresh works
  • lane-specific exports work
  • training submit works
  • local adoption API is healthy

But the full automation chain still breaks on the return path.

Live Findings

Attack Paths fix guidance

  • Open Fix Guidance on Attack Paths was a placebo button.
  • Fixed in:
    • magatama/packages/dashboard/public/index-v2.html
  • Live behavior now:
    • opens the real finding/ticket drawer when the graph node maps to a finding
    • otherwise shows an explicit warning

Local train API rechecked

  • GET http://127.0.0.1:3214/health
  • result:
    • status = ok
    • service reachable
    • service idle

Conclusion:

  • local adoption/import service is not the current blocker

RunPod raw status canary

A tiny direct canary was executed against the same endpoint:

  • lane: tip_llm
  • steps: 1
  • job:
    • 33434e85-3cc1-4dea-9043-83c315aaeb9c-e2

Observed via raw /status/{job} polling:

  • IN_QUEUE
  • IN_PROGRESS
  • COMPLETED

Critical detail:

  • /status/{job} had no output
  • /stream/{job} returned:
    • {"status":"COMPLETED","stream":[]}

This confirms that the current endpoint does not hand MAGATAMA the explicit artifact metadata it needs for automatic adoption.

HF token check

Erik was checked directly:

  • /opt/magatama/secrets/hf-token
    • exists
    • readable

Conclusion:

  • the current failure is not a missing Hugging Face token on Erik

Root Cause

The managed Axolotl serverless endpoint is not enough for MAGATAMA's required end-to-end automation.

MAGATAMA needs a worker that can:

  1. train the lane-specific dataset
  2. upload the resulting adapter/model artifact explicitly
  3. return a machine-readable artifact reference
  4. let MAGATAMA adopt/import that artifact
  5. run smoke tests
  6. bump version
  7. switch the active alias

The managed Axolotl endpoint currently only gives lifecycle state, not an adoptable return artifact.

Code Completed

Prepared the correct custom-worker path in:

  • magatama/packages/fine-tuner/train_cuda.py
  • magatama/packages/fine-tuner/runpod_handler.py
  • magatama/packages/fine-tuner/requirements-runpod.txt
  • magatama/packages/dashboard/src/server.ts

What changed

  • custom RunPod worker input now supports:
    • target_model
    • credentials.hf_token
  • train_cuda.py now:
    • trains from the signed MAGATAMA lane bundle
    • uploads the resulting adapter folder to Hugging Face
    • returns adapter_repo_id
  • dashboard custom-worker submit path now sends:
    • run_id
    • target_model
    • worker HF credential
  • dashboard errors are now explicit when the managed endpoint completes without an adoptable artifact

Live Deployment Status

Deployed live to Erik:

  • rebuilt and rsynced dashboard server
  • synced updated custom worker source files into repo state on Erik
  • restarted pm2 magatama-dashboard

Not yet completed in infrastructure:

  • the active RunPod endpoint itself is still the managed Axolotl endpoint

Required Final Infra Step

To get true full automation:

  1. build/publish:
    • magatama/packages/fine-tuner/Dockerfile.runpod
  2. create or switch to a custom RunPod serverless endpoint running:
    • runpod_handler.py
  3. set on Erik:
    • RUNPOD_WORKER_KIND=custom-magatama
    • RUNPOD_ENDPOINT_ID=<custom-endpoint-id>

Only then will MAGATAMA be able to:

  • pull the lane-specific training pool
  • train on RunPod
  • get back a real adapter artifact
  • adopt it locally into Ollama
  • write a new version number
  • repoint the active alias after smoke tests