LLM Training Pool Research Expansion

Date: 2026-05-10 UTC Owner: Codex

Summary

Codex expanded MAGATAMA's training pools with curated research input for all five active LLM lanes.

The work intentionally avoided copying third-party article/vendor-document bodies. Instead, it added source metadata, provenance-safe source seeds, crawler/evaluation policies, and original SFT examples that teach the desired behavior.

Files Added In Magatama

training-data/model-registry/external-ingest/llm-lane-research-seeds-2026-05-10.jsonl
training-data/gitea-learning-pool/magatamallm/curated-web-research-2026-05-10.train.jsonl
training-data/gitea-learning-pool/magatamallm/curated-web-research-2026-05-10.valid.jsonl
training-data/gitea-learning-pool/fo_blogllm/curated-web-research-2026-05-10.train.jsonl
training-data/gitea-learning-pool/fo_blogllm/curated-web-research-2026-05-10.valid.jsonl
training-data/gitea-learning-pool/tip_llm/curated-web-research-2026-05-10.train.jsonl
training-data/gitea-learning-pool/tip_llm/curated-web-research-2026-05-10.valid.jsonl
training-data/gitea-learning-pool/pulso_llm/curated-web-research-2026-05-10.train.jsonl
training-data/gitea-learning-pool/pulso_llm/curated-web-research-2026-05-10.valid.jsonl
training-data/gitea-learning-pool/contact_llm/curated-web-research-2026-05-10.train.jsonl
training-data/gitea-learning-pool/contact_llm/curated-web-research-2026-05-10.valid.jsonl

Builder Change

scripts/runpod_dataset_builder.ts now automatically reads supplemental *.train.jsonl and *.valid.jsonl files from each lane's Gitea learning-pool directory.

This means future curated research drops can be added as small lane-specific files without manually editing huge train.jsonl/valid.jsonl files.

Rebuilt Dataset Counts

magatamallm: 1396 train / 156 eval / 1552 total
fo_blogllm: 17357 train / 1931 eval / 19288 total
tip_llm: 303 train / 35 eval / 338 total
pulso_llm: 54 train / 8 eval / 62 total
contact_llm: 33 train / 6 eval / 39 total

Lane Policy Notes

magatamallm: cybersecurity, AI security, infrastructure security, proof-before-close, safe remediation, artifact-gated training adoption.
fo_blogllm: Rene/Flexoptix technical and founder voice, source-backed blogs, no copied source bodies, no fabricated numbers.
tip_llm: crawler, scraper, parser, robots, search patterns, switch/transceiver/vendor/advisory/forum research.
pulso_llm: customer-facing Flexoptix/switch/transceiver support, product planning, troubleshooting and quote preparation; never invent SKU, price, stock, warranty or compatibility.
contact_llm: public business/contact research with source provenance, schema.org/RDAP/PeeringDB/security.txt awareness, robots/privacy guardrails.

Verification

JSONL validation passed for all new files.
pnpm exec tsx scripts/runpod_dataset_builder.ts passed outside sandbox after tsx IPC needed host permissions.
pnpm exec tsx scripts/model_registry_build.ts passed.
git diff --check passed.

Follow-Up

Push Magatama training-pool changes to Gitea.
Deploy or pull the updated builder and training-data on Erik before starting the next training run.
For the next RunPod training run, verify artifact adoption and version bump per strict success rule.

3.3 KiB Raw Blame History