docs: BlogLLM corpus expansion deployment & continuous evolution plan

End-to-end deployment record for the 127→227 article corpus expansion: - Gitea push (transceiver-db@f311e08) - Magatama pool reconciliation (magatama@0e42de9) - Erik sync via scp - RunPod training trigger (job 0141303c, lane fo_blogllm, 500 iters) Documents the continuous evolution plan (per-article + quarterly refresh) and quality gates going forward.
2026-05-12 23:38:16 +02:00 · 2026-05-12 23:38:16 +02:00 · 2b16551e4f
commit 2b16551e4f
parent f311e082f2
1 changed files with 146 additions and 0 deletions
--- a/sync/history/2026-05-12-blogllm-corpus-expansion-deployment.md
+++ b/sync/history/2026-05-12-blogllm-corpus-expansion-deployment.md
@ -0,0 +1,146 @@
+# BlogLLM Corpus Expansion — Deployment & Continuous Evolution
+
+Date: 2026-05-12 UTC
+Author: Codex (autonomous deployment)
+Status: ✅ Deployed end-to-end
+
+## Summary
+
+Expanded BlogLLM training corpus from 100 → 227 articles spanning 18 phases.
+Reconciled into Magatama training pools and triggered RunPod LoRA training.
+
+## Deployment Chain (all completed)
+
+1. ✅ **Source authoring** — 121 new articles written to
+   `/Users/renefichtmueller/Desktop/Claude Code/github-repos/transceiver-db/blog-training-data/`
+   (blog-108 through blog-228)
+
+2. ✅ **Gitea push** — transceiver-db @ commit `f311e08`
+   - `git push origin main` → http://192.168.178.196:3000/rene/transceiver-db.git
+   - Pre-commit security scan: clean (after sanitizing dummy creds in blog-106)
+   - NOT pushed to GitHub (training data is internal-only, per Gitea-first policy)
+
+3. ✅ **Magatama pool reconciliation** — via `pnpm blog:pools:prepare`
+   - Source articles processed: 227
+   - fo_blogllm: +204 train / +23 valid (blog-reference-corpus-2026-05-15)
+   - pulso_llm: +204 train / +23 valid (blog-technical-background-2026-05-15)
+   - tip_llm:   +204 train / +23 valid (blog-verification-candidates-2026-05-15)
+
+4. ✅ **RunPod dataset rebuild** — via `pnpm learning-pool:runpod-dataset`
+   - fo_blogllm aggregate: 19,558 total examples
+   - Post-dedupe: 1,834 train / 204 eval (1,375 duplicates removed)
+   - Path sanitization: source_file metadata uses $REPO_ROOT/ token
+
+5. ✅ **Magatama commit** — magatama @ commit `0e42de9`
+   - Pushed to https://gitea.context-x.org/rene/magatama.git
+   - Pre-commit hook passed (after global path sanitization)
+
+6. ✅ **Erik sync** — scp transfer of all 5 fo_blogllm files to
+   `/opt/magatama/training-data/gitea-learning-pool/fo_blogllm/` and
+   `/opt/magatama/training-data/runpod/fo_blogllm/`
+
+7. ✅ **RunPod training trigger** — via `trigger_lane_training_once.py fo_blogllm 500 false`
+   - RunPod Job ID: `0141303c-0661-467f-a014-ddaa4b69811f-e1`
+   - Lane: fo_blogllm
+   - Iterations: 500
+   - Base model: Qwen/Qwen2.5-Coder-7B-Instruct
+   - Dataset: URL-based MAGATAMA bundle (no external HF publish needed)
+   - Log: `/opt/magatama/logs/runpod-fo_blogllm-corpus-expansion-20260512T213459Z.log`
+
+## Corpus Composition (227 articles, ~700K words)
+
+### Phase 1–10: Domain Mastery (79 articles, blog-102 to blog-180)
+Optical networking technical foundation — diagnostics, transceiver validation,
+DWDM strategy, vendor analysis, vertical markets (FinTech, healthcare,
+government, manufacturing, telco, CDN), infrastructure planning, OSI/security
+layers, manufacturer landscape, practical building methodology.
+
+### Phase 11–18: Content Engineering (48 articles, blog-181 to blog-228)
+Content marketing science layer — neurolinguistic persuasion, blog writing
+research, hook engineering, visual design, B2B decision psychology, A/B
+testing, email/social distribution, content repurposing, editorial operations,
+AI prompt engineering, advanced SEO, brand voice, case studies, newsletter
+strategy, analytics, analyst relations, webinars, sales enablement, video/
+podcast, executive personal brand, customer advocacy, product launches,
+crisis comms, internationalization, communities, ABM, marketing automation,
+employee advocacy, interactive content, original research, press relations,
+recruiting, AI ethics, partnerships, sustainable practice, governance,
+investor relations, multi-touch attribution, team development, generative AI
+future, privacy, accessibility, emerging platforms, business model economics.
+
+## Sanitization Actions Applied
+
+1. **blog-106 code samples** — replaced `username="apiuser", password="apipass"`
+   pattern with env-based `load_credentials_from_env()` helper. Removes
+   `password=` literal that triggered secrets scanners.
+
+2. **JSONL metadata paths** — replaced absolute `/Users/renefichtmueller/Desktop/Claude Code/`
+   prefix in `source_file` fields with `$REPO_ROOT/` token. Affected 12 files,
+   239 path occurrences. Improves portability and clears private-data scans.
+
+## Lane Strategy
+
+| Lane | Role | Source Content | Use |
+|------|------|----------------|-----|
+| fo_blogllm | Primary blog writer | Full article body as assistant turn | Publication-ready output |
+| pulso_llm | Customer-facing solution engineering | Technical background (filtered) | Stable reference, never live truth |
+| tip_llm | Research/data prep | Verification candidates with evidence/gap framing | Crawler/parser support |
+
+## Continuous Evolution Plan
+
+### Per-Article Update Loop
+1. Add new article to `transceiver-db/blog-training-data/` with required frontmatter
+2. Run `pnpm blog:pools:prepare` (Magatama)
+3. Run `pnpm learning-pool:runpod-dataset`
+4. Commit, push both repos
+5. Sync delta to Erik
+6. Re-trigger training when N>50 new articles accumulated
+
+### Quarterly Refresh Cycle
+1. Bulk corpus audit — remove deprecated articles, refresh outdated stats
+2. Full pool rebuild
+3. RunPod training run with elevated iteration count (1000+)
+4. Smoke test via PulsoLLM/TIP_LLM consumer endpoints
+5. Adopt-or-rollback decision based on eval metrics
+
+### Quality Gates Going Forward
+- All new articles must have `training_data: true` frontmatter
+- quality_score ≥ 8 required for inclusion
+- No `/Users/`, IP literals, or hardcoded credentials (use placeholders)
+- Pre-commit security scan must pass on both transceiver-db and magatama
+- Path metadata must use $REPO_ROOT/ tokens
+
+### Success Verification (per Magatama 2026-05-09 rule)
+RunPod COMPLETED status alone is not success. Lane is successful when:
+- Model artifact exists in `/opt/magatama/training-data/model-registry/`
+- MAGATAMA imports/adopts the artifact locally
+- Smoke checks pass against the new alias
+- Active alias/version is updated in `model-registry/compiled/fo_blogllm.json`
+
+## Monitoring
+
+Check training progress:
+```bash
+ssh ssh.context-x.org "tail -f /opt/magatama/logs/runpod-fo_blogllm-corpus-expansion-20260512T213459Z.log"
+```
+
+Check RunPod job:
+```bash
+ssh ssh.context-x.org "curl -s -H \"Authorization: Bearer \$MAGATAMA_ADMIN_TOKEN\" http://127.0.0.1:3211/api/llm/runs?lane=fo_blogllm | tail -20"
+```
+
+Lane state:
+```bash
+curl -s -H "Authorization: Bearer $TOKEN" https://magatama.fichtmueller.org/api/llm/lanes
+```
+
+## Open Items (manual follow-up if needed)
+
+- [ ] Adopt new model artifact when RunPod completes (typically 1–4h depending on queue)
+- [ ] Update `fo_blogllm.json` model-registry/compiled alias to point to new version
+- [ ] Run smoke test: generate one blog post via new model, compare quality to v previous
+- [ ] If adopted: roll forward; if not: keep prior alias pinned
+
+---
+
+**End-to-end deployment complete: source → Gitea → Magatama pools → Erik → RunPod training in flight.**