End-to-end deployment record for the 127→227 article corpus expansion: - Gitea push (transceiver-db@f311e08) - Magatama pool reconciliation (magatama@0e42de9) - Erik sync via scp - RunPod training trigger (job 0141303c, lane fo_blogllm, 500 iters) Documents the continuous evolution plan (per-article + quarterly refresh) and quality gates going forward.
6.6 KiB
BlogLLM Corpus Expansion — Deployment & Continuous Evolution
Date: 2026-05-12 UTC Author: Codex (autonomous deployment) Status: ✅ Deployed end-to-end
Summary
Expanded BlogLLM training corpus from 100 → 227 articles spanning 18 phases. Reconciled into Magatama training pools and triggered RunPod LoRA training.
Deployment Chain (all completed)
-
✅ Source authoring — 121 new articles written to
/Users/renefichtmueller/Desktop/Claude Code/github-repos/transceiver-db/blog-training-data/(blog-108 through blog-228) -
✅ Gitea push — transceiver-db @ commit
f311e08git push origin main→ http://192.168.178.196:3000/rene/transceiver-db.git- Pre-commit security scan: clean (after sanitizing dummy creds in blog-106)
- NOT pushed to GitHub (training data is internal-only, per Gitea-first policy)
-
✅ Magatama pool reconciliation — via
pnpm blog:pools:prepare- Source articles processed: 227
- fo_blogllm: +204 train / +23 valid (blog-reference-corpus-2026-05-15)
- pulso_llm: +204 train / +23 valid (blog-technical-background-2026-05-15)
- tip_llm: +204 train / +23 valid (blog-verification-candidates-2026-05-15)
-
✅ RunPod dataset rebuild — via
pnpm learning-pool:runpod-dataset- fo_blogllm aggregate: 19,558 total examples
- Post-dedupe: 1,834 train / 204 eval (1,375 duplicates removed)
- Path sanitization: source_file metadata uses $REPO_ROOT/ token
-
✅ Magatama commit — magatama @ commit
0e42de9- Pushed to https://gitea.context-x.org/rene/magatama.git
- Pre-commit hook passed (after global path sanitization)
-
✅ Erik sync — scp transfer of all 5 fo_blogllm files to
/opt/magatama/training-data/gitea-learning-pool/fo_blogllm/and/opt/magatama/training-data/runpod/fo_blogllm/ -
✅ RunPod training trigger — via
trigger_lane_training_once.py fo_blogllm 500 false- RunPod Job ID:
0141303c-0661-467f-a014-ddaa4b69811f-e1 - Lane: fo_blogllm
- Iterations: 500
- Base model: Qwen/Qwen2.5-Coder-7B-Instruct
- Dataset: URL-based MAGATAMA bundle (no external HF publish needed)
- Log:
/opt/magatama/logs/runpod-fo_blogllm-corpus-expansion-20260512T213459Z.log
- RunPod Job ID:
Corpus Composition (227 articles, ~700K words)
Phase 1–10: Domain Mastery (79 articles, blog-102 to blog-180)
Optical networking technical foundation — diagnostics, transceiver validation, DWDM strategy, vendor analysis, vertical markets (FinTech, healthcare, government, manufacturing, telco, CDN), infrastructure planning, OSI/security layers, manufacturer landscape, practical building methodology.
Phase 11–18: Content Engineering (48 articles, blog-181 to blog-228)
Content marketing science layer — neurolinguistic persuasion, blog writing research, hook engineering, visual design, B2B decision psychology, A/B testing, email/social distribution, content repurposing, editorial operations, AI prompt engineering, advanced SEO, brand voice, case studies, newsletter strategy, analytics, analyst relations, webinars, sales enablement, video/ podcast, executive personal brand, customer advocacy, product launches, crisis comms, internationalization, communities, ABM, marketing automation, employee advocacy, interactive content, original research, press relations, recruiting, AI ethics, partnerships, sustainable practice, governance, investor relations, multi-touch attribution, team development, generative AI future, privacy, accessibility, emerging platforms, business model economics.
Sanitization Actions Applied
-
blog-106 code samples — replaced
username="apiuser", password="apipass"pattern with env-basedload_credentials_from_env()helper. Removespassword=literal that triggered secrets scanners. -
JSONL metadata paths — replaced absolute
/Users/renefichtmueller/Desktop/Claude Code/prefix insource_filefields with$REPO_ROOT/token. Affected 12 files, 239 path occurrences. Improves portability and clears private-data scans.
Lane Strategy
| Lane | Role | Source Content | Use |
|---|---|---|---|
| fo_blogllm | Primary blog writer | Full article body as assistant turn | Publication-ready output |
| pulso_llm | Customer-facing solution engineering | Technical background (filtered) | Stable reference, never live truth |
| tip_llm | Research/data prep | Verification candidates with evidence/gap framing | Crawler/parser support |
Continuous Evolution Plan
Per-Article Update Loop
- Add new article to
transceiver-db/blog-training-data/with required frontmatter - Run
pnpm blog:pools:prepare(Magatama) - Run
pnpm learning-pool:runpod-dataset - Commit, push both repos
- Sync delta to Erik
- Re-trigger training when N>50 new articles accumulated
Quarterly Refresh Cycle
- Bulk corpus audit — remove deprecated articles, refresh outdated stats
- Full pool rebuild
- RunPod training run with elevated iteration count (1000+)
- Smoke test via PulsoLLM/TIP_LLM consumer endpoints
- Adopt-or-rollback decision based on eval metrics
Quality Gates Going Forward
- All new articles must have
training_data: truefrontmatter - quality_score ≥ 8 required for inclusion
- No
/Users/, IP literals, or hardcoded credentials (use placeholders) - Pre-commit security scan must pass on both transceiver-db and magatama
- Path metadata must use $REPO_ROOT/ tokens
Success Verification (per Magatama 2026-05-09 rule)
RunPod COMPLETED status alone is not success. Lane is successful when:
- Model artifact exists in
/opt/magatama/training-data/model-registry/ - MAGATAMA imports/adopts the artifact locally
- Smoke checks pass against the new alias
- Active alias/version is updated in
model-registry/compiled/fo_blogllm.json
Monitoring
Check training progress:
ssh ssh.context-x.org "tail -f /opt/magatama/logs/runpod-fo_blogllm-corpus-expansion-20260512T213459Z.log"
Check RunPod job:
ssh ssh.context-x.org "curl -s -H \"Authorization: Bearer \$MAGATAMA_ADMIN_TOKEN\" http://127.0.0.1:3211/api/llm/runs?lane=fo_blogllm | tail -20"
Lane state:
curl -s -H "Authorization: Bearer $TOKEN" https://magatama.fichtmueller.org/api/llm/lanes
Open Items (manual follow-up if needed)
- Adopt new model artifact when RunPod completes (typically 1–4h depending on queue)
- Update
fo_blogllm.jsonmodel-registry/compiled alias to point to new version - Run smoke test: generate one blog post via new model, compare quality to v previous
- If adopted: roll forward; if not: keep prior alias pinned
End-to-end deployment complete: source → Gitea → Magatama pools → Erik → RunPod training in flight.