transceiver-db/sync/history/2026-05-09-magatamallm-runpod-adoption-closure.md
2026-05-09 22:28:49 +02:00

70 lines
2.8 KiB
Markdown

# MagatamaLLM RunPod Adoption Closure
Date: 2026-05-09 20:25 UTC
## What Changed
- Completed the MagatamaLLM RunPod training closure without launching a new paid RunPod job.
- Recovered the local adoption path after the RunPod worker had already trained and uploaded the adapter successfully.
- Deployed a MAGATAMA dashboard server fix so the live training status reflects the final adopted model instead of stale `completed_not_adopted` metadata.
- Synced the adoption metadata back to Erik and verified the public MAGATAMA status endpoint.
## Run Details
- Lane: `magatamallm`
- Endpoint: `0rmkf28w2g5gip`
- Job: `a46de2ef-96e0-4adf-bbf8-d7a890e06c6f-e2`
- Run id: `magatamallm-2026-05-09T19-22-53`
- HF artifact: `renefichtmueller/magatama-magatamallm-magatamallm-2026-05-09t19-22-53`
- Worker summary: `RunPod QLoRA complete · train=605 · valid=114`
- Local candidate: `magatamallm-runpod-magatamallm-2026-05-09t19-22-53`
- Release alias: `magatama-coder-r1`
- Active alias: `magatama-coder:latest`
- Candidate smoke: `4/5` with required threshold `4`
- Direct local smoke: exact `MAGATAMA-R1-READY`
## Failure Recovery
- First adoption failed because Mac Studio had too little free disk for GGUF conversion after writing the merged model.
- Removed only safe temporary/import blockers:
- failed MagatamaLLM merged `model.safetensors`
- FO_BlogLLM/TIP_LLM source GGUF import files that were already registered in Ollama
- old non-active Ollama test model `test-qwen32b:latest`
- Active aliases remained intact:
- `magatama-coder:latest`
- `fo-blog-v7`
- `tip-llm-v1`
## Dashboard Fix
- Registry ordering now uses `recorded_at` with fallback to `completed_at`, `adopted_at`, and `created_at`.
- Successful adoption version selection now accepts top-level `release_alias` and `candidate_model`, not only nested `adoption.*` payloads.
- Legacy MagatamaLLM baseline mismatch protection no longer invalidates the RunPod lane export.
- Deployed rebuilt `packages/dashboard/dist/server.js` to Erik and restarted `magatama-dashboard`.
## Live Verification
- MAGATAMA `magatamallm` status:
- `activeProvider=ollama:magatama-coder:latest`
- `modelVersion=magatama-coder-r1`
- `lastRegistryRunStatus=completed_and_adopted`
- `activeRun=null`
- `hasTrustedTrainingBaseline=true`
- `newSinceLastTraining=0`
- `collectedExamples=1367`
- `evalExamples=152`
- `totalExamples=1519`
- FO_BlogLLM stayed healthy:
- `modelVersion=fo-blog-v7-r1`
- `activeRun=null`
- `newSinceLastTraining=0`
- TIP_LLM stayed healthy:
- `modelVersion=tip-llm-v1-r1`
- `activeRun=null`
- `newSinceLastTraining=0`
## Open
- Add more explicit MagatamaLLM examples for the rule: insufficient evidence means escalate/manual review rather than passive monitoring.
- Complete dual-Gitea mirroring separately.