transceiver-db/sync/history/2026-05-10-magatama-all-lane-runpod-training-complete.md
2026-05-10 04:59:46 +02:00

49 lines
2.6 KiB
Markdown

# MAGATAMA All-Lane RunPod Training Complete
Date: 2026-05-10 02:58 UTC
## Result
All five MAGATAMA trainable LLM lanes completed a real RunPod training/adoption cycle and are now visible as adopted in the public MAGATAMA status API.
## Verified Lanes
- `magatamallm`: active `magatama-coder:latest`, model version `magatama-coder-r2`, `1375 train / 153 eval / 1528 total`
- `fo_blogllm`: active `fo-blog-v8`, model version `fo-blog-v8-r2`, `17342 train / 1929 eval / 19271 total`
- `tip_llm`: active `tip-llm-v2`, model version `tip-llm-v2-r2`, `276 train / 31 eval / 307 total`
- `pulso_llm`: active `pulso-llm-v1`, model version `pulso-llm-v1-r1`, `28 train / 5 eval / 33 total`
- `contact_llm`: active `contact-llm-v1`, model version `contact-llm-v1-r1`, `18 train / 4 eval / 22 total`
## Fixes Made
- Added/verified first-class local adoption support for `pulso_llm` and `contact_llm`.
- Added authenticated adoption-report recovery endpoint on the Mac training/adoption service.
- Hardened dashboard adoption flow so transient network/fetch errors can recover from local adoption reports.
- Hardened RunPod reconciler so completed jobs can be adopted after a failed live SSE/browser path.
- Registry success events now include explicit active model, release alias, model version, version counter and candidate model.
- Rebuilt the MAGATAMA model registry and restarted `magatama-dashboard` after successful TIP and Contact adoption.
## Issues Resolved
- `pulso_llm` showed `unknown lane: pulso_llm` after RunPod finished; this was a local adoption mapping issue, not a training failure. Pulso is now active.
- `tip_llm` failed local adoption because Mac disk space dropped below the GGUF conversion threshold. Obsolete non-active Ollama versions and already imported intermediate GGUFs were removed, then TIP was reconciled successfully.
- `contact_llm` had never been trained before this block. It now has a first adopted version.
## Evaluation Notes
- ContactLLM smoke test passed `4/5`.
- Open improvement: ContactLLM should consistently return provenance fields for public business contacts: source URL, timestamp, confidence and contact type.
## Operating Rule
Do not mark RunPod training successful on `COMPLETED` alone. A successful lane run must have:
- uploaded adapter artifact
- successful local Mac adoption
- Ollama candidate + release alias + active alias
- smoke tests meeting threshold
- registry entry with `completed_and_adopted`
- public MAGATAMA `/api/llm/status?lane=...` showing the new active model/version
No secrets, tokens or credentials are recorded in this handoff.