diff --git a/sync/history/2026-05-12-blogllm-corpus-expansion-deployment.md b/sync/history/2026-05-12-blogllm-corpus-expansion-deployment.md new file mode 100644 index 0000000..fa83aed --- /dev/null +++ b/sync/history/2026-05-12-blogllm-corpus-expansion-deployment.md @@ -0,0 +1,146 @@ +# BlogLLM Corpus Expansion — Deployment & Continuous Evolution + +Date: 2026-05-12 UTC +Author: Codex (autonomous deployment) +Status: ✅ Deployed end-to-end + +## Summary + +Expanded BlogLLM training corpus from 100 → 227 articles spanning 18 phases. +Reconciled into Magatama training pools and triggered RunPod LoRA training. + +## Deployment Chain (all completed) + +1. ✅ **Source authoring** — 121 new articles written to + `/Users/renefichtmueller/Desktop/Claude Code/github-repos/transceiver-db/blog-training-data/` + (blog-108 through blog-228) + +2. ✅ **Gitea push** — transceiver-db @ commit `f311e08` + - `git push origin main` → http://192.168.178.196:3000/rene/transceiver-db.git + - Pre-commit security scan: clean (after sanitizing dummy creds in blog-106) + - NOT pushed to GitHub (training data is internal-only, per Gitea-first policy) + +3. ✅ **Magatama pool reconciliation** — via `pnpm blog:pools:prepare` + - Source articles processed: 227 + - fo_blogllm: +204 train / +23 valid (blog-reference-corpus-2026-05-15) + - pulso_llm: +204 train / +23 valid (blog-technical-background-2026-05-15) + - tip_llm: +204 train / +23 valid (blog-verification-candidates-2026-05-15) + +4. ✅ **RunPod dataset rebuild** — via `pnpm learning-pool:runpod-dataset` + - fo_blogllm aggregate: 19,558 total examples + - Post-dedupe: 1,834 train / 204 eval (1,375 duplicates removed) + - Path sanitization: source_file metadata uses $REPO_ROOT/ token + +5. ✅ **Magatama commit** — magatama @ commit `0e42de9` + - Pushed to https://gitea.context-x.org/rene/magatama.git + - Pre-commit hook passed (after global path sanitization) + +6. ✅ **Erik sync** — scp transfer of all 5 fo_blogllm files to + `/opt/magatama/training-data/gitea-learning-pool/fo_blogllm/` and + `/opt/magatama/training-data/runpod/fo_blogllm/` + +7. ✅ **RunPod training trigger** — via `trigger_lane_training_once.py fo_blogllm 500 false` + - RunPod Job ID: `0141303c-0661-467f-a014-ddaa4b69811f-e1` + - Lane: fo_blogllm + - Iterations: 500 + - Base model: Qwen/Qwen2.5-Coder-7B-Instruct + - Dataset: URL-based MAGATAMA bundle (no external HF publish needed) + - Log: `/opt/magatama/logs/runpod-fo_blogllm-corpus-expansion-20260512T213459Z.log` + +## Corpus Composition (227 articles, ~700K words) + +### Phase 1–10: Domain Mastery (79 articles, blog-102 to blog-180) +Optical networking technical foundation — diagnostics, transceiver validation, +DWDM strategy, vendor analysis, vertical markets (FinTech, healthcare, +government, manufacturing, telco, CDN), infrastructure planning, OSI/security +layers, manufacturer landscape, practical building methodology. + +### Phase 11–18: Content Engineering (48 articles, blog-181 to blog-228) +Content marketing science layer — neurolinguistic persuasion, blog writing +research, hook engineering, visual design, B2B decision psychology, A/B +testing, email/social distribution, content repurposing, editorial operations, +AI prompt engineering, advanced SEO, brand voice, case studies, newsletter +strategy, analytics, analyst relations, webinars, sales enablement, video/ +podcast, executive personal brand, customer advocacy, product launches, +crisis comms, internationalization, communities, ABM, marketing automation, +employee advocacy, interactive content, original research, press relations, +recruiting, AI ethics, partnerships, sustainable practice, governance, +investor relations, multi-touch attribution, team development, generative AI +future, privacy, accessibility, emerging platforms, business model economics. + +## Sanitization Actions Applied + +1. **blog-106 code samples** — replaced `username="apiuser", password="apipass"` + pattern with env-based `load_credentials_from_env()` helper. Removes + `password=` literal that triggered secrets scanners. + +2. **JSONL metadata paths** — replaced absolute `/Users/renefichtmueller/Desktop/Claude Code/` + prefix in `source_file` fields with `$REPO_ROOT/` token. Affected 12 files, + 239 path occurrences. Improves portability and clears private-data scans. + +## Lane Strategy + +| Lane | Role | Source Content | Use | +|------|------|----------------|-----| +| fo_blogllm | Primary blog writer | Full article body as assistant turn | Publication-ready output | +| pulso_llm | Customer-facing solution engineering | Technical background (filtered) | Stable reference, never live truth | +| tip_llm | Research/data prep | Verification candidates with evidence/gap framing | Crawler/parser support | + +## Continuous Evolution Plan + +### Per-Article Update Loop +1. Add new article to `transceiver-db/blog-training-data/` with required frontmatter +2. Run `pnpm blog:pools:prepare` (Magatama) +3. Run `pnpm learning-pool:runpod-dataset` +4. Commit, push both repos +5. Sync delta to Erik +6. Re-trigger training when N>50 new articles accumulated + +### Quarterly Refresh Cycle +1. Bulk corpus audit — remove deprecated articles, refresh outdated stats +2. Full pool rebuild +3. RunPod training run with elevated iteration count (1000+) +4. Smoke test via PulsoLLM/TIP_LLM consumer endpoints +5. Adopt-or-rollback decision based on eval metrics + +### Quality Gates Going Forward +- All new articles must have `training_data: true` frontmatter +- quality_score ≥ 8 required for inclusion +- No `/Users/`, IP literals, or hardcoded credentials (use placeholders) +- Pre-commit security scan must pass on both transceiver-db and magatama +- Path metadata must use $REPO_ROOT/ tokens + +### Success Verification (per Magatama 2026-05-09 rule) +RunPod COMPLETED status alone is not success. Lane is successful when: +- Model artifact exists in `/opt/magatama/training-data/model-registry/` +- MAGATAMA imports/adopts the artifact locally +- Smoke checks pass against the new alias +- Active alias/version is updated in `model-registry/compiled/fo_blogllm.json` + +## Monitoring + +Check training progress: +```bash +ssh ssh.context-x.org "tail -f /opt/magatama/logs/runpod-fo_blogllm-corpus-expansion-20260512T213459Z.log" +``` + +Check RunPod job: +```bash +ssh ssh.context-x.org "curl -s -H \"Authorization: Bearer \$MAGATAMA_ADMIN_TOKEN\" http://127.0.0.1:3211/api/llm/runs?lane=fo_blogllm | tail -20" +``` + +Lane state: +```bash +curl -s -H "Authorization: Bearer $TOKEN" https://magatama.fichtmueller.org/api/llm/lanes +``` + +## Open Items (manual follow-up if needed) + +- [ ] Adopt new model artifact when RunPod completes (typically 1–4h depending on queue) +- [ ] Update `fo_blogllm.json` model-registry/compiled alias to point to new version +- [ ] Run smoke test: generate one blog post via new model, compare quality to v previous +- [ ] If adopted: roll forward; if not: keep prior alias pinned + +--- + +**End-to-end deployment complete: source → Gitea → Magatama pools → Erik → RunPod training in flight.**