docs: BlogLLM corpus expansion deployment & continuous evolution plan

End-to-end deployment record for the 127→227 article corpus expansion: - Gitea push (transceiver-db@f311e08) - Magatama pool reconciliation (magatama@0e42de9) - Erik sync via scp - RunPod training trigger (job 0141303c, lane fo_blogllm, 500 iters) Documents the continuous evolution plan (per-article + quarterly refresh) and quality gates going forward.
2026-05-12 23:38:16 +02:00 · 2026-05-12 23:38:16 +02:00 · 2b16551e4f
commit 2b16551e4f
parent f311e082f2
1 changed files with 146 additions and 0 deletions
--- a/sync/history/2026-05-12-blogllm-corpus-expansion-deployment.md
+++ b/sync/history/2026-05-12-blogllm-corpus-expansion-deployment.md
@ -0,0 +1,146 @@
 # BlogLLM Corpus Expansion — Deployment & Continuous Evolution
 Date: 2026-05-12 UTC
 Author: Codex (autonomous deployment)
 Status: ✅ Deployed end-to-end
 ## Summary
 Expanded BlogLLM training corpus from 100 → 227 articles spanning 18 phases.
 Reconciled into Magatama training pools and triggered RunPod LoRA training.
 ## Deployment Chain (all completed)
 1. ✅ **Source authoring** — 121 new articles written to
   `/Users/renefichtmueller/Desktop/Claude Code/github-repos/transceiver-db/blog-training-data/`
   (blog-108 through blog-228)
 2. ✅ **Gitea push** — transceiver-db @ commit `f311e08`
   - `git push origin main` → http://192.168.178.196:3000/rene/transceiver-db.git
   - Pre-commit security scan: clean (after sanitizing dummy creds in blog-106)
   - NOT pushed to GitHub (training data is internal-only, per Gitea-first policy)
 3. ✅ **Magatama pool reconciliation** — via `pnpm blog:pools:prepare`
   - Source articles processed: 227
   - fo_blogllm: +204 train / +23 valid (blog-reference-corpus-2026-05-15)
   - pulso_llm: +204 train / +23 valid (blog-technical-background-2026-05-15)
   - tip_llm:   +204 train / +23 valid (blog-verification-candidates-2026-05-15)
 4. ✅ **RunPod dataset rebuild** — via `pnpm learning-pool:runpod-dataset`
   - fo_blogllm aggregate: 19,558 total examples
   - Post-dedupe: 1,834 train / 204 eval (1,375 duplicates removed)
   - Path sanitization: source_file metadata uses $REPO_ROOT/ token
 5. ✅ **Magatama commit** — magatama @ commit `0e42de9`
   - Pushed to https://gitea.context-x.org/rene/magatama.git
   - Pre-commit hook passed (after global path sanitization)
 6. ✅ **Erik sync** — scp transfer of all 5 fo_blogllm files to
   `/opt/magatama/training-data/gitea-learning-pool/fo_blogllm/` and
   `/opt/magatama/training-data/runpod/fo_blogllm/`
 7. ✅ **RunPod training trigger** — via `trigger_lane_training_once.py fo_blogllm 500 false`
   - RunPod Job ID: `0141303c-0661-467f-a014-ddaa4b69811f-e1`
   - Lane: fo_blogllm
   - Iterations: 500
   - Base model: Qwen/Qwen2.5-Coder-7B-Instruct
   - Dataset: URL-based MAGATAMA bundle (no external HF publish needed)
   - Log: `/opt/magatama/logs/runpod-fo_blogllm-corpus-expansion-20260512T213459Z.log`
 ## Corpus Composition (227 articles, ~700K words)
 ### Phase 1–10: Domain Mastery (79 articles, blog-102 to blog-180)
 Optical networking technical foundation — diagnostics, transceiver validation,
 DWDM strategy, vendor analysis, vertical markets (FinTech, healthcare,
 government, manufacturing, telco, CDN), infrastructure planning, OSI/security
 layers, manufacturer landscape, practical building methodology.
 ### Phase 11–18: Content Engineering (48 articles, blog-181 to blog-228)
 Content marketing science layer — neurolinguistic persuasion, blog writing
 research, hook engineering, visual design, B2B decision psychology, A/B
 testing, email/social distribution, content repurposing, editorial operations,
 AI prompt engineering, advanced SEO, brand voice, case studies, newsletter
 strategy, analytics, analyst relations, webinars, sales enablement, video/
 podcast, executive personal brand, customer advocacy, product launches,
 crisis comms, internationalization, communities, ABM, marketing automation,
 employee advocacy, interactive content, original research, press relations,
 recruiting, AI ethics, partnerships, sustainable practice, governance,
 investor relations, multi-touch attribution, team development, generative AI
 future, privacy, accessibility, emerging platforms, business model economics.
 ## Sanitization Actions Applied
 1. **blog-106 code samples** — replaced `username="apiuser", password="apipass"`
   pattern with env-based `load_credentials_from_env()` helper. Removes
   `password=` literal that triggered secrets scanners.
 2. **JSONL metadata paths** — replaced absolute `/Users/renefichtmueller/Desktop/Claude Code/`
   prefix in `source_file` fields with `$REPO_ROOT/` token. Affected 12 files,
   239 path occurrences. Improves portability and clears private-data scans.
 ## Lane Strategy
 | Lane | Role | Source Content | Use |
 |------|------|----------------|-----|
 | fo_blogllm | Primary blog writer | Full article body as assistant turn | Publication-ready output |
 | pulso_llm | Customer-facing solution engineering | Technical background (filtered) | Stable reference, never live truth |
 | tip_llm | Research/data prep | Verification candidates with evidence/gap framing | Crawler/parser support |
 ## Continuous Evolution Plan
 ### Per-Article Update Loop
 1. Add new article to `transceiver-db/blog-training-data/` with required frontmatter
 2. Run `pnpm blog:pools:prepare` (Magatama)
 3. Run `pnpm learning-pool:runpod-dataset`
 4. Commit, push both repos
 5. Sync delta to Erik
 6. Re-trigger training when N>50 new articles accumulated
 ### Quarterly Refresh Cycle
 1. Bulk corpus audit — remove deprecated articles, refresh outdated stats
 2. Full pool rebuild
 3. RunPod training run with elevated iteration count (1000+)
 4. Smoke test via PulsoLLM/TIP_LLM consumer endpoints
 5. Adopt-or-rollback decision based on eval metrics
 ### Quality Gates Going Forward
 - All new articles must have `training_data: true` frontmatter
 - quality_score ≥ 8 required for inclusion
 - No `/Users/`, IP literals, or hardcoded credentials (use placeholders)
 - Pre-commit security scan must pass on both transceiver-db and magatama
 - Path metadata must use $REPO_ROOT/ tokens
 ### Success Verification (per Magatama 2026-05-09 rule)
 RunPod COMPLETED status alone is not success. Lane is successful when:
 - Model artifact exists in `/opt/magatama/training-data/model-registry/`
 - MAGATAMA imports/adopts the artifact locally
 - Smoke checks pass against the new alias
 - Active alias/version is updated in `model-registry/compiled/fo_blogllm.json`
 ## Monitoring
 Check training progress:
 ```bash
 ssh ssh.context-x.org "tail -f /opt/magatama/logs/runpod-fo_blogllm-corpus-expansion-20260512T213459Z.log"
 ```
 Check RunPod job:
 ```bash
 ssh ssh.context-x.org "curl -s -H \"Authorization: Bearer \$MAGATAMA_ADMIN_TOKEN\" http://127.0.0.1:3211/api/llm/runs?lane=fo_blogllm | tail -20"
 ```
 Lane state:
 ```bash
 curl -s -H "Authorization: Bearer $TOKEN" https://magatama.fichtmueller.org/api/llm/lanes
 ```
 ## Open Items (manual follow-up if needed)
 - [ ] Adopt new model artifact when RunPod completes (typically 1–4h depending on queue)
 - [ ] Update `fo_blogllm.json` model-registry/compiled alias to point to new version
 - [ ] Run smoke test: generate one blog post via new model, compare quality to v previous
 - [ ] If adopted: roll forward; if not: keep prior alias pinned
 ---
 **End-to-end deployment complete: source → Gitea → Magatama pools → Erik → RunPod training in flight.**