64 lines
1.5 KiB
Markdown
64 lines
1.5 KiB
Markdown
# MAGATAMA All-Lane RunPod Training Start
|
|
|
|
Date: 2026-05-09 23:09 UTC
|
|
|
|
## Scope
|
|
|
|
- Train all current MAGATAMA LLM lanes via RunPod:
|
|
- `magatamallm`
|
|
- `fo_blogllm`
|
|
- `tip_llm`
|
|
- `pulso_llm`
|
|
- `contact_llm`
|
|
|
|
## Preflight
|
|
|
|
- MAGATAMA services were online on Erik.
|
|
- Active RunPod endpoint reported by MAGATAMA: `0rmkf28w2g5gip`.
|
|
- RunPod worker kind: `custom-magatama`.
|
|
- Dataset source: URL-based lane export.
|
|
- Previous successful/adopted runs existed for:
|
|
- `magatamallm`
|
|
- `fo_blogllm`
|
|
- `tip_llm`
|
|
- No previous run existed yet for:
|
|
- `pulso_llm`
|
|
- `contact_llm`
|
|
|
|
## Runner Fix
|
|
|
|
- Fixed `scripts/trigger_lane_training_once.py` locally and on Erik.
|
|
- The script previously used stale API keys:
|
|
- `iterations`
|
|
- `seedOnly`
|
|
- The MAGATAMA training API expects:
|
|
- `iters`
|
|
- `seed_only`
|
|
- Added `all` mode to run all lanes sequentially.
|
|
- Added streamed SSE logging so progress is visible during long RunPod runs.
|
|
|
|
## Live Run
|
|
|
|
- Started on Erik:
|
|
- `python3 -u scripts/trigger_lane_training_once.py all 500 false`
|
|
- Log:
|
|
- `/opt/magatama/logs/runpod-all-lanes-20260509T230549Z.log`
|
|
- First active lane:
|
|
- `magatamallm`
|
|
- First RunPod job:
|
|
- `89627e7e-8533-45db-9fe8-eca994018aa6-e2`
|
|
- Initial `magatamallm` dataset:
|
|
- `1375 train`
|
|
- `153 eval`
|
|
- `1528 total`
|
|
|
|
## Success Rule
|
|
|
|
- Do not treat RunPod `COMPLETED` as success by itself.
|
|
- A lane is only successful when:
|
|
- the model artifact exists,
|
|
- MAGATAMA imports/adopts it locally,
|
|
- smoke checks pass,
|
|
- the active alias/version is updated.
|
|
|