sync: record magatamallm adoption closure

This commit is contained in:
Rene Fichtmueller 2026-05-09 22:28:49 +02:00
parent 1af4f090f7
commit de2943ea79
2 changed files with 111 additions and 0 deletions

View File

@ -4,6 +4,48 @@ Updated: 2026-05-09 20:12 UTC
## Newest Work ## Newest Work
- MAGATAMA MagatamaLLM RunPod training and adoption closure on 2026-05-09:
- operator requirement:
- RunPod success only counts after artifact exists, local Ollama import works, smoke tests pass, aliases/version switch, remote registry is updated, and live MAGATAMA reports no stale active run
- do not spend another RunPod run when the paid training already completed; recover adoption instead
- RunPod job completed:
- endpoint `0rmkf28w2g5gip`
- job `a46de2ef-96e0-4adf-bbf8-d7a890e06c6f-e2`
- run id `magatamallm-2026-05-09T19-22-53`
- target artifact `renefichtmueller/magatama-magatamallm-magatamallm-2026-05-09t19-22-53`
- worker summary `RunPod QLoRA complete · train=605 · valid=114`
- adoption recovered:
- initial local adoption failed because Mac Studio had too little free disk for GGUF conversion after the merged model was written
- removed only temporary/import-safe blockers:
- failed MagatamaLLM merged `model.safetensors`
- already imported FO_BlogLLM and TIP_LLM source GGUF files
- old non-active Ollama test model `test-qwen32b:latest`
- kept active Ollama aliases intact: `magatama-coder:latest`, `fo-blog-v7`, `tip-llm-v1`
- adoption completed:
- local candidate `magatamallm-runpod-magatamallm-2026-05-09t19-22-53`
- release alias `magatama-coder-r1`
- active alias `magatama-coder:latest`
- candidate smoke `4/5` passed with the required threshold `4`
- direct local smoke returned exact `MAGATAMA-R1-READY`
- dashboard/server correction:
- deployed a MAGATAMA dashboard server fix so training registry ordering uses `recorded_at`, with `completed_at/adopted_at/created_at` fallbacks
- release/version selection now accepts top-level `release_alias` and `candidate_model` on adoption events
- legacy MagatamaLLM baseline mismatch guard no longer invalidates the new RunPod lane export
- restarted `magatama-dashboard`
- live verification:
- `magatamallm` reports `activeProvider=ollama:magatama-coder:latest`
- `modelVersion=magatama-coder-r1`
- `lastRegistryRunStatus=completed_and_adopted`
- `activeRun=null`
- `hasTrustedTrainingBaseline=true`
- `newSinceLastTraining=0`
- lane export shows `1367` train, `152` eval, `1519` total
- `fo_blogllm` remains `fo-blog-v7-r1`, `activeRun=null`, `newSinceLastTraining=0`
- `tip_llm` remains `tip-llm-v1-r1`, `activeRun=null`, `newSinceLastTraining=0`
- open:
- add more explicit training pairs for the “insufficient evidence => escalate/manual review” behavior because the new MagatamaLLM passed the required smoke threshold but still answered that one eval too passively
- complete dual-Gitea mirroring as a separate infrastructure closure item
- TIP verification artifact cleanup and vendor completion on 2026-05-09: - TIP verification artifact cleanup and vendor completion on 2026-05-09:
- operator requirement: - operator requirement:
- continue until all source-backed verification work is exhausted - continue until all source-backed verification work is exhausted

View File

@ -0,0 +1,69 @@
# MagatamaLLM RunPod Adoption Closure
Date: 2026-05-09 20:25 UTC
## What Changed
- Completed the MagatamaLLM RunPod training closure without launching a new paid RunPod job.
- Recovered the local adoption path after the RunPod worker had already trained and uploaded the adapter successfully.
- Deployed a MAGATAMA dashboard server fix so the live training status reflects the final adopted model instead of stale `completed_not_adopted` metadata.
- Synced the adoption metadata back to Erik and verified the public MAGATAMA status endpoint.
## Run Details
- Lane: `magatamallm`
- Endpoint: `0rmkf28w2g5gip`
- Job: `a46de2ef-96e0-4adf-bbf8-d7a890e06c6f-e2`
- Run id: `magatamallm-2026-05-09T19-22-53`
- HF artifact: `renefichtmueller/magatama-magatamallm-magatamallm-2026-05-09t19-22-53`
- Worker summary: `RunPod QLoRA complete · train=605 · valid=114`
- Local candidate: `magatamallm-runpod-magatamallm-2026-05-09t19-22-53`
- Release alias: `magatama-coder-r1`
- Active alias: `magatama-coder:latest`
- Candidate smoke: `4/5` with required threshold `4`
- Direct local smoke: exact `MAGATAMA-R1-READY`
## Failure Recovery
- First adoption failed because Mac Studio had too little free disk for GGUF conversion after writing the merged model.
- Removed only safe temporary/import blockers:
- failed MagatamaLLM merged `model.safetensors`
- FO_BlogLLM/TIP_LLM source GGUF import files that were already registered in Ollama
- old non-active Ollama test model `test-qwen32b:latest`
- Active aliases remained intact:
- `magatama-coder:latest`
- `fo-blog-v7`
- `tip-llm-v1`
## Dashboard Fix
- Registry ordering now uses `recorded_at` with fallback to `completed_at`, `adopted_at`, and `created_at`.
- Successful adoption version selection now accepts top-level `release_alias` and `candidate_model`, not only nested `adoption.*` payloads.
- Legacy MagatamaLLM baseline mismatch protection no longer invalidates the RunPod lane export.
- Deployed rebuilt `packages/dashboard/dist/server.js` to Erik and restarted `magatama-dashboard`.
## Live Verification
- MAGATAMA `magatamallm` status:
- `activeProvider=ollama:magatama-coder:latest`
- `modelVersion=magatama-coder-r1`
- `lastRegistryRunStatus=completed_and_adopted`
- `activeRun=null`
- `hasTrustedTrainingBaseline=true`
- `newSinceLastTraining=0`
- `collectedExamples=1367`
- `evalExamples=152`
- `totalExamples=1519`
- FO_BlogLLM stayed healthy:
- `modelVersion=fo-blog-v7-r1`
- `activeRun=null`
- `newSinceLastTraining=0`
- TIP_LLM stayed healthy:
- `modelVersion=tip-llm-v1-r1`
- `activeRun=null`
- `newSinceLastTraining=0`
## Open
- Add more explicit MagatamaLLM examples for the rule: insufficient evidence means escalate/manual review rather than passive monitoring.
- Complete dual-Gitea mirroring separately.