sync: record magatama all-lane training completion

This commit is contained in:
Rene Fichtmueller 2026-05-10 04:59:46 +02:00
parent 0599991431
commit cf30735ef1
2 changed files with 76 additions and 1 deletions

View File

@ -1,9 +1,36 @@
# Current TIP Sync State # Current TIP Sync State
Updated: 2026-05-09 23:38 UTC Updated: 2026-05-10 02:58 UTC
## Newest Work ## Newest Work
- MAGATAMA all-lane RunPod training completion on 2026-05-10:
- RunPod training/adoption is now verified end-to-end for all five active MAGATAMA LLM lanes:
- `magatamallm`: active `magatama-coder:latest`, model version `magatama-coder-r2`, dataset `1375 train / 153 eval / 1528 total`
- `fo_blogllm`: active `fo-blog-v8`, model version `fo-blog-v8-r2`, dataset `17342 train / 1929 eval / 19271 total`
- `tip_llm`: active `tip-llm-v2`, model version `tip-llm-v2-r2`, dataset `276 train / 31 eval / 307 total`
- `pulso_llm`: active `pulso-llm-v1`, model version `pulso-llm-v1-r1`, dataset `28 train / 5 eval / 33 total`
- `contact_llm`: active `contact-llm-v1`, model version `contact-llm-v1-r1`, dataset `18 train / 4 eval / 22 total`
- strict adoption rule is now validated in production:
- RunPod `COMPLETED` alone is not a success
- success requires uploaded adapter artifact, local Mac adoption, Ollama model registration, smoke tests, registry write, dashboard registry rebuild and active alias switch
- fixed/verified automation behavior:
- local Mac adoption service exposes authenticated adoption reports per lane via `/adoption-report/{lane}`
- dashboard adoption path can recover from transient network/fetch errors by reading the local adoption report
- reconciler can adopt already-completed RunPod jobs when the live SSE path failed after artifact upload
- registry events now include top-level `active_model`, `release_alias`, `model_version`, `version_counter` and `candidate_model`
- resolved concrete failures:
- `pulso_llm` training had succeeded, but old local lane mapping caused `unknown lane: pulso_llm`; Pulso is now adopted and active
- `tip_llm` training succeeded but local adoption failed due low Mac disk space before GGUF conversion; safe obsolete Ollama versions and imported intermediate GGUFs were removed, then TIP was reconciled successfully
- `contact_llm` was still `neverTrained`; it is now trained, adopted and active
- ContactLLM smoke test result:
- `4/5` checks passed
- remaining improvement: provenance prompt should always include source URL, timestamp, confidence and contact type; add this as a next training/eval item
- public Magatama `/api/llm/status?lane=...` checks after dashboard restart show all five lanes as `completed_and_adopted`
- operational note:
- keep enough Mac free space before another adoption; each new 7B adapter adoption needs merge + GGUF conversion workspace
- obsolete non-active Ollama versions can be removed after verifying active aliases and release aliases exist
- TIP price/source verification closure on 2026-05-10 local / 2026-05-09 UTC: - TIP price/source verification closure on 2026-05-10 local / 2026-05-09 UTC:
- fixed SFPcables scraper to persist `product_page_url` - fixed SFPcables scraper to persist `product_page_url`
- added product-page price fallback for SFPcables when listing pages omit price markup - added product-page price fallback for SFPcables when listing pages omit price markup

View File

@ -0,0 +1,48 @@
# MAGATAMA All-Lane RunPod Training Complete
Date: 2026-05-10 02:58 UTC
## Result
All five MAGATAMA trainable LLM lanes completed a real RunPod training/adoption cycle and are now visible as adopted in the public MAGATAMA status API.
## Verified Lanes
- `magatamallm`: active `magatama-coder:latest`, model version `magatama-coder-r2`, `1375 train / 153 eval / 1528 total`
- `fo_blogllm`: active `fo-blog-v8`, model version `fo-blog-v8-r2`, `17342 train / 1929 eval / 19271 total`
- `tip_llm`: active `tip-llm-v2`, model version `tip-llm-v2-r2`, `276 train / 31 eval / 307 total`
- `pulso_llm`: active `pulso-llm-v1`, model version `pulso-llm-v1-r1`, `28 train / 5 eval / 33 total`
- `contact_llm`: active `contact-llm-v1`, model version `contact-llm-v1-r1`, `18 train / 4 eval / 22 total`
## Fixes Made
- Added/verified first-class local adoption support for `pulso_llm` and `contact_llm`.
- Added authenticated adoption-report recovery endpoint on the Mac training/adoption service.
- Hardened dashboard adoption flow so transient network/fetch errors can recover from local adoption reports.
- Hardened RunPod reconciler so completed jobs can be adopted after a failed live SSE/browser path.
- Registry success events now include explicit active model, release alias, model version, version counter and candidate model.
- Rebuilt the MAGATAMA model registry and restarted `magatama-dashboard` after successful TIP and Contact adoption.
## Issues Resolved
- `pulso_llm` showed `unknown lane: pulso_llm` after RunPod finished; this was a local adoption mapping issue, not a training failure. Pulso is now active.
- `tip_llm` failed local adoption because Mac disk space dropped below the GGUF conversion threshold. Obsolete non-active Ollama versions and already imported intermediate GGUFs were removed, then TIP was reconciled successfully.
- `contact_llm` had never been trained before this block. It now has a first adopted version.
## Evaluation Notes
- ContactLLM smoke test passed `4/5`.
- Open improvement: ContactLLM should consistently return provenance fields for public business contacts: source URL, timestamp, confidence and contact type.
## Operating Rule
Do not mark RunPod training successful on `COMPLETED` alone. A successful lane run must have:
- uploaded adapter artifact
- successful local Mac adoption
- Ollama candidate + release alias + active alias
- smoke tests meeting threshold
- registry entry with `completed_and_adopted`
- public MAGATAMA `/api/llm/status?lane=...` showing the new active model/version
No secrets, tokens or credentials are recorded in this handoff.