docs: BlogLLM corpus expansion deployment & continuous evolution plan
End-to-end deployment record for the 127→227 article corpus expansion: - Gitea push (transceiver-db@f311e08) - Magatama pool reconciliation (magatama@0e42de9) - Erik sync via scp - RunPod training trigger (job 0141303c, lane fo_blogllm, 500 iters) Documents the continuous evolution plan (per-article + quarterly refresh) and quality gates going forward.
This commit is contained in:
parent
f311e082f2
commit
2b16551e4f
146
sync/history/2026-05-12-blogllm-corpus-expansion-deployment.md
Normal file
146
sync/history/2026-05-12-blogllm-corpus-expansion-deployment.md
Normal file
@ -0,0 +1,146 @@
|
||||
# BlogLLM Corpus Expansion — Deployment & Continuous Evolution
|
||||
|
||||
Date: 2026-05-12 UTC
|
||||
Author: Codex (autonomous deployment)
|
||||
Status: ✅ Deployed end-to-end
|
||||
|
||||
## Summary
|
||||
|
||||
Expanded BlogLLM training corpus from 100 → 227 articles spanning 18 phases.
|
||||
Reconciled into Magatama training pools and triggered RunPod LoRA training.
|
||||
|
||||
## Deployment Chain (all completed)
|
||||
|
||||
1. ✅ **Source authoring** — 121 new articles written to
|
||||
`/Users/renefichtmueller/Desktop/Claude Code/github-repos/transceiver-db/blog-training-data/`
|
||||
(blog-108 through blog-228)
|
||||
|
||||
2. ✅ **Gitea push** — transceiver-db @ commit `f311e08`
|
||||
- `git push origin main` → http://192.168.178.196:3000/rene/transceiver-db.git
|
||||
- Pre-commit security scan: clean (after sanitizing dummy creds in blog-106)
|
||||
- NOT pushed to GitHub (training data is internal-only, per Gitea-first policy)
|
||||
|
||||
3. ✅ **Magatama pool reconciliation** — via `pnpm blog:pools:prepare`
|
||||
- Source articles processed: 227
|
||||
- fo_blogllm: +204 train / +23 valid (blog-reference-corpus-2026-05-15)
|
||||
- pulso_llm: +204 train / +23 valid (blog-technical-background-2026-05-15)
|
||||
- tip_llm: +204 train / +23 valid (blog-verification-candidates-2026-05-15)
|
||||
|
||||
4. ✅ **RunPod dataset rebuild** — via `pnpm learning-pool:runpod-dataset`
|
||||
- fo_blogllm aggregate: 19,558 total examples
|
||||
- Post-dedupe: 1,834 train / 204 eval (1,375 duplicates removed)
|
||||
- Path sanitization: source_file metadata uses $REPO_ROOT/ token
|
||||
|
||||
5. ✅ **Magatama commit** — magatama @ commit `0e42de9`
|
||||
- Pushed to https://gitea.context-x.org/rene/magatama.git
|
||||
- Pre-commit hook passed (after global path sanitization)
|
||||
|
||||
6. ✅ **Erik sync** — scp transfer of all 5 fo_blogllm files to
|
||||
`/opt/magatama/training-data/gitea-learning-pool/fo_blogllm/` and
|
||||
`/opt/magatama/training-data/runpod/fo_blogllm/`
|
||||
|
||||
7. ✅ **RunPod training trigger** — via `trigger_lane_training_once.py fo_blogllm 500 false`
|
||||
- RunPod Job ID: `0141303c-0661-467f-a014-ddaa4b69811f-e1`
|
||||
- Lane: fo_blogllm
|
||||
- Iterations: 500
|
||||
- Base model: Qwen/Qwen2.5-Coder-7B-Instruct
|
||||
- Dataset: URL-based MAGATAMA bundle (no external HF publish needed)
|
||||
- Log: `/opt/magatama/logs/runpod-fo_blogllm-corpus-expansion-20260512T213459Z.log`
|
||||
|
||||
## Corpus Composition (227 articles, ~700K words)
|
||||
|
||||
### Phase 1–10: Domain Mastery (79 articles, blog-102 to blog-180)
|
||||
Optical networking technical foundation — diagnostics, transceiver validation,
|
||||
DWDM strategy, vendor analysis, vertical markets (FinTech, healthcare,
|
||||
government, manufacturing, telco, CDN), infrastructure planning, OSI/security
|
||||
layers, manufacturer landscape, practical building methodology.
|
||||
|
||||
### Phase 11–18: Content Engineering (48 articles, blog-181 to blog-228)
|
||||
Content marketing science layer — neurolinguistic persuasion, blog writing
|
||||
research, hook engineering, visual design, B2B decision psychology, A/B
|
||||
testing, email/social distribution, content repurposing, editorial operations,
|
||||
AI prompt engineering, advanced SEO, brand voice, case studies, newsletter
|
||||
strategy, analytics, analyst relations, webinars, sales enablement, video/
|
||||
podcast, executive personal brand, customer advocacy, product launches,
|
||||
crisis comms, internationalization, communities, ABM, marketing automation,
|
||||
employee advocacy, interactive content, original research, press relations,
|
||||
recruiting, AI ethics, partnerships, sustainable practice, governance,
|
||||
investor relations, multi-touch attribution, team development, generative AI
|
||||
future, privacy, accessibility, emerging platforms, business model economics.
|
||||
|
||||
## Sanitization Actions Applied
|
||||
|
||||
1. **blog-106 code samples** — replaced `username="apiuser", password="apipass"`
|
||||
pattern with env-based `load_credentials_from_env()` helper. Removes
|
||||
`password=` literal that triggered secrets scanners.
|
||||
|
||||
2. **JSONL metadata paths** — replaced absolute `/Users/renefichtmueller/Desktop/Claude Code/`
|
||||
prefix in `source_file` fields with `$REPO_ROOT/` token. Affected 12 files,
|
||||
239 path occurrences. Improves portability and clears private-data scans.
|
||||
|
||||
## Lane Strategy
|
||||
|
||||
| Lane | Role | Source Content | Use |
|
||||
|------|------|----------------|-----|
|
||||
| fo_blogllm | Primary blog writer | Full article body as assistant turn | Publication-ready output |
|
||||
| pulso_llm | Customer-facing solution engineering | Technical background (filtered) | Stable reference, never live truth |
|
||||
| tip_llm | Research/data prep | Verification candidates with evidence/gap framing | Crawler/parser support |
|
||||
|
||||
## Continuous Evolution Plan
|
||||
|
||||
### Per-Article Update Loop
|
||||
1. Add new article to `transceiver-db/blog-training-data/` with required frontmatter
|
||||
2. Run `pnpm blog:pools:prepare` (Magatama)
|
||||
3. Run `pnpm learning-pool:runpod-dataset`
|
||||
4. Commit, push both repos
|
||||
5. Sync delta to Erik
|
||||
6. Re-trigger training when N>50 new articles accumulated
|
||||
|
||||
### Quarterly Refresh Cycle
|
||||
1. Bulk corpus audit — remove deprecated articles, refresh outdated stats
|
||||
2. Full pool rebuild
|
||||
3. RunPod training run with elevated iteration count (1000+)
|
||||
4. Smoke test via PulsoLLM/TIP_LLM consumer endpoints
|
||||
5. Adopt-or-rollback decision based on eval metrics
|
||||
|
||||
### Quality Gates Going Forward
|
||||
- All new articles must have `training_data: true` frontmatter
|
||||
- quality_score ≥ 8 required for inclusion
|
||||
- No `/Users/`, IP literals, or hardcoded credentials (use placeholders)
|
||||
- Pre-commit security scan must pass on both transceiver-db and magatama
|
||||
- Path metadata must use $REPO_ROOT/ tokens
|
||||
|
||||
### Success Verification (per Magatama 2026-05-09 rule)
|
||||
RunPod COMPLETED status alone is not success. Lane is successful when:
|
||||
- Model artifact exists in `/opt/magatama/training-data/model-registry/`
|
||||
- MAGATAMA imports/adopts the artifact locally
|
||||
- Smoke checks pass against the new alias
|
||||
- Active alias/version is updated in `model-registry/compiled/fo_blogllm.json`
|
||||
|
||||
## Monitoring
|
||||
|
||||
Check training progress:
|
||||
```bash
|
||||
ssh ssh.context-x.org "tail -f /opt/magatama/logs/runpod-fo_blogllm-corpus-expansion-20260512T213459Z.log"
|
||||
```
|
||||
|
||||
Check RunPod job:
|
||||
```bash
|
||||
ssh ssh.context-x.org "curl -s -H \"Authorization: Bearer \$MAGATAMA_ADMIN_TOKEN\" http://127.0.0.1:3211/api/llm/runs?lane=fo_blogllm | tail -20"
|
||||
```
|
||||
|
||||
Lane state:
|
||||
```bash
|
||||
curl -s -H "Authorization: Bearer $TOKEN" https://magatama.fichtmueller.org/api/llm/lanes
|
||||
```
|
||||
|
||||
## Open Items (manual follow-up if needed)
|
||||
|
||||
- [ ] Adopt new model artifact when RunPod completes (typically 1–4h depending on queue)
|
||||
- [ ] Update `fo_blogllm.json` model-registry/compiled alias to point to new version
|
||||
- [ ] Run smoke test: generate one blog post via new model, compare quality to v previous
|
||||
- [ ] If adopted: roll forward; if not: keep prior alias pinned
|
||||
|
||||
---
|
||||
|
||||
**End-to-end deployment complete: source → Gitea → Magatama pools → Erik → RunPod training in flight.**
|
||||
Loading…
x
Reference in New Issue
Block a user