diff --git a/sync/CURRENT.md b/sync/CURRENT.md index d602b39..c34e992 100644 --- a/sync/CURRENT.md +++ b/sync/CURRENT.md @@ -1,9 +1,39 @@ # Current TIP Sync State -Updated: 2026-05-09 22:32 UTC +Updated: 2026-05-09 23:09 UTC ## Newest Work +- MAGATAMA all-lane RunPod training block started on 2026-05-09: + - user requested all trainable LLM lanes via RunPod + - lanes in scope: + - `magatamallm` + - `fo_blogllm` + - `tip_llm` + - `pulso_llm` + - `contact_llm` + - preflight: + - MAGATAMA services online on Erik + - active RunPod endpoint: `0rmkf28w2g5gip` + - worker kind: `custom-magatama` + - dataset source: URL lane export + - latest previous adopted runs existed for `magatamallm`, `fo_blogllm`, `tip_llm` + - `pulso_llm` and `contact_llm` had no previous adopted RunPod run + - fixed live/local helper script: + - `scripts/trigger_lane_training_once.py` + - API payload now uses `iters` and `seed_only` instead of stale `iterations` and `seedOnly` + - added `all` mode for sequential full-lane training + - streams SSE lines to the log instead of buffering until the response closes + - live sequence started on Erik: + - command: `python3 -u scripts/trigger_lane_training_once.py all 500 false` + - log: `/opt/magatama/logs/runpod-all-lanes-20260509T230549Z.log` + - first active lane: `magatamallm` + - first RunPod job: `89627e7e-8533-45db-9fe8-eca994018aa6-e2` + - `magatamallm` dataset at start: `1375 train`, `153 eval`, `1528 total` + - success rule remains strict: + - RunPod `COMPLETED` alone is not sufficient + - artifact must exist, import/adoption must succeed, smoke checks must pass, and active alias/version must update + - TIP verification continuation on 2026-05-09: - expanded deterministic non-transceiver quarantine for GBICS and T&S Communication artifacts - live quarantine result: diff --git a/sync/history/2026-05-09-magatama-all-lane-runpod-training-start.md b/sync/history/2026-05-09-magatama-all-lane-runpod-training-start.md new file mode 100644 index 0000000..673c951 --- /dev/null +++ b/sync/history/2026-05-09-magatama-all-lane-runpod-training-start.md @@ -0,0 +1,63 @@ +# MAGATAMA All-Lane RunPod Training Start + +Date: 2026-05-09 23:09 UTC + +## Scope + +- Train all current MAGATAMA LLM lanes via RunPod: + - `magatamallm` + - `fo_blogllm` + - `tip_llm` + - `pulso_llm` + - `contact_llm` + +## Preflight + +- MAGATAMA services were online on Erik. +- Active RunPod endpoint reported by MAGATAMA: `0rmkf28w2g5gip`. +- RunPod worker kind: `custom-magatama`. +- Dataset source: URL-based lane export. +- Previous successful/adopted runs existed for: + - `magatamallm` + - `fo_blogllm` + - `tip_llm` +- No previous run existed yet for: + - `pulso_llm` + - `contact_llm` + +## Runner Fix + +- Fixed `scripts/trigger_lane_training_once.py` locally and on Erik. +- The script previously used stale API keys: + - `iterations` + - `seedOnly` +- The MAGATAMA training API expects: + - `iters` + - `seed_only` +- Added `all` mode to run all lanes sequentially. +- Added streamed SSE logging so progress is visible during long RunPod runs. + +## Live Run + +- Started on Erik: + - `python3 -u scripts/trigger_lane_training_once.py all 500 false` +- Log: + - `/opt/magatama/logs/runpod-all-lanes-20260509T230549Z.log` +- First active lane: + - `magatamallm` +- First RunPod job: + - `89627e7e-8533-45db-9fe8-eca994018aa6-e2` +- Initial `magatamallm` dataset: + - `1375 train` + - `153 eval` + - `1528 total` + +## Success Rule + +- Do not treat RunPod `COMPLETED` as success by itself. +- A lane is only successful when: + - the model artifact exists, + - MAGATAMA imports/adopts it locally, + - smoke checks pass, + - the active alias/version is updated. +