transceiver-db/sync/history/2026-05-09-magatamallm-runpod-adoption-closure.md
2026-05-09 22:28:49 +02:00

2.8 KiB

MagatamaLLM RunPod Adoption Closure

Date: 2026-05-09 20:25 UTC

What Changed

  • Completed the MagatamaLLM RunPod training closure without launching a new paid RunPod job.
  • Recovered the local adoption path after the RunPod worker had already trained and uploaded the adapter successfully.
  • Deployed a MAGATAMA dashboard server fix so the live training status reflects the final adopted model instead of stale completed_not_adopted metadata.
  • Synced the adoption metadata back to Erik and verified the public MAGATAMA status endpoint.

Run Details

  • Lane: magatamallm
  • Endpoint: 0rmkf28w2g5gip
  • Job: a46de2ef-96e0-4adf-bbf8-d7a890e06c6f-e2
  • Run id: magatamallm-2026-05-09T19-22-53
  • HF artifact: renefichtmueller/magatama-magatamallm-magatamallm-2026-05-09t19-22-53
  • Worker summary: RunPod QLoRA complete · train=605 · valid=114
  • Local candidate: magatamallm-runpod-magatamallm-2026-05-09t19-22-53
  • Release alias: magatama-coder-r1
  • Active alias: magatama-coder:latest
  • Candidate smoke: 4/5 with required threshold 4
  • Direct local smoke: exact MAGATAMA-R1-READY

Failure Recovery

  • First adoption failed because Mac Studio had too little free disk for GGUF conversion after writing the merged model.
  • Removed only safe temporary/import blockers:
    • failed MagatamaLLM merged model.safetensors
    • FO_BlogLLM/TIP_LLM source GGUF import files that were already registered in Ollama
    • old non-active Ollama test model test-qwen32b:latest
  • Active aliases remained intact:
    • magatama-coder:latest
    • fo-blog-v7
    • tip-llm-v1

Dashboard Fix

  • Registry ordering now uses recorded_at with fallback to completed_at, adopted_at, and created_at.
  • Successful adoption version selection now accepts top-level release_alias and candidate_model, not only nested adoption.* payloads.
  • Legacy MagatamaLLM baseline mismatch protection no longer invalidates the RunPod lane export.
  • Deployed rebuilt packages/dashboard/dist/server.js to Erik and restarted magatama-dashboard.

Live Verification

  • MAGATAMA magatamallm status:
    • activeProvider=ollama:magatama-coder:latest
    • modelVersion=magatama-coder-r1
    • lastRegistryRunStatus=completed_and_adopted
    • activeRun=null
    • hasTrustedTrainingBaseline=true
    • newSinceLastTraining=0
    • collectedExamples=1367
    • evalExamples=152
    • totalExamples=1519
  • FO_BlogLLM stayed healthy:
    • modelVersion=fo-blog-v7-r1
    • activeRun=null
    • newSinceLastTraining=0
  • TIP_LLM stayed healthy:
    • modelVersion=tip-llm-v1-r1
    • activeRun=null
    • newSinceLastTraining=0

Open

  • Add more explicit MagatamaLLM examples for the rule: insufficient evidence means escalate/manual review rather than passive monitoring.
  • Complete dual-Gitea mirroring separately.