diff --git a/sync/CURRENT.md b/sync/CURRENT.md index 426246d..d528f8e 100644 --- a/sync/CURRENT.md +++ b/sync/CURRENT.md @@ -1,9 +1,33 @@ # Current TIP Sync State -Updated: 2026-05-10 07:38 UTC +Updated: 2026-05-10 07:54 UTC ## Newest Work +- MAGATAMA LLM training-pool research expansion on 2026-05-10 UTC: + - added a new curated external source ingest file in Magatama: + - `training-data/model-registry/external-ingest/llm-lane-research-seeds-2026-05-10.jsonl` + - `55` source metadata records across MAGATAMA, FO_BlogLLM, TIP_LLM, PulsoLLM and ContactLLM + - added lane-specific curated training/eval supplements: + - `magatamallm`: `3 train`, `1 eval` + - `fo_blogllm`: `3 train`, `1 eval` + - `tip_llm`: `3 train`, `1 eval` + - `pulso_llm`: `3 train`, `1 eval` + - `contact_llm`: `3 train`, `1 eval` + - changed `scripts/runpod_dataset_builder.ts` so every lane automatically picks up supplemental `*.train.jsonl` and `*.valid.jsonl` files in its Gitea learning-pool directory + - rebuilt RunPod datasets and model registry locally: + - `magatamallm`: `1396 train / 156 eval / 1552 total` + - `fo_blogllm`: `17357 train / 1931 eval / 19288 total` + - `tip_llm`: `303 train / 35 eval / 338 total` + - `pulso_llm`: `54 train / 8 eval / 62 total` + - `contact_llm`: `33 train / 6 eval / 39 total` + - policy decisions: + - no bulk copying of third-party blogs/vendor docs into training pools + - use official/OSS/web sources as metadata, provenance, crawler planning, eval, and original SFT behavior examples + - TIPLLM remains crawler/research/parser lane + - PulsoLLM shares network/transceiver/switch knowledge core but stays customer/support/quote behavior lane + - ContactLLM must preserve provenance and avoid private-data overreach + - TIP active-base cleanup continuation on 2026-05-10 UTC: - fixed FS.com category leakage: - new FS.com `/c/` category/landing rows quarantined diff --git a/sync/history/2026-05-10-llm-training-pool-research-expansion.md b/sync/history/2026-05-10-llm-training-pool-research-expansion.md new file mode 100644 index 0000000..b0e4146 --- /dev/null +++ b/sync/history/2026-05-10-llm-training-pool-research-expansion.md @@ -0,0 +1,59 @@ +# LLM Training Pool Research Expansion + +Date: 2026-05-10 UTC +Owner: Codex + +## Summary + +Codex expanded MAGATAMA's training pools with curated research input for all five active LLM lanes. + +The work intentionally avoided copying third-party article/vendor-document bodies. Instead, it added source metadata, provenance-safe source seeds, crawler/evaluation policies, and original SFT examples that teach the desired behavior. + +## Files Added In Magatama + +- `training-data/model-registry/external-ingest/llm-lane-research-seeds-2026-05-10.jsonl` +- `training-data/gitea-learning-pool/magatamallm/curated-web-research-2026-05-10.train.jsonl` +- `training-data/gitea-learning-pool/magatamallm/curated-web-research-2026-05-10.valid.jsonl` +- `training-data/gitea-learning-pool/fo_blogllm/curated-web-research-2026-05-10.train.jsonl` +- `training-data/gitea-learning-pool/fo_blogllm/curated-web-research-2026-05-10.valid.jsonl` +- `training-data/gitea-learning-pool/tip_llm/curated-web-research-2026-05-10.train.jsonl` +- `training-data/gitea-learning-pool/tip_llm/curated-web-research-2026-05-10.valid.jsonl` +- `training-data/gitea-learning-pool/pulso_llm/curated-web-research-2026-05-10.train.jsonl` +- `training-data/gitea-learning-pool/pulso_llm/curated-web-research-2026-05-10.valid.jsonl` +- `training-data/gitea-learning-pool/contact_llm/curated-web-research-2026-05-10.train.jsonl` +- `training-data/gitea-learning-pool/contact_llm/curated-web-research-2026-05-10.valid.jsonl` + +## Builder Change + +`scripts/runpod_dataset_builder.ts` now automatically reads supplemental `*.train.jsonl` and `*.valid.jsonl` files from each lane's Gitea learning-pool directory. + +This means future curated research drops can be added as small lane-specific files without manually editing huge `train.jsonl`/`valid.jsonl` files. + +## Rebuilt Dataset Counts + +- `magatamallm`: `1396 train / 156 eval / 1552 total` +- `fo_blogllm`: `17357 train / 1931 eval / 19288 total` +- `tip_llm`: `303 train / 35 eval / 338 total` +- `pulso_llm`: `54 train / 8 eval / 62 total` +- `contact_llm`: `33 train / 6 eval / 39 total` + +## Lane Policy Notes + +- `magatamallm`: cybersecurity, AI security, infrastructure security, proof-before-close, safe remediation, artifact-gated training adoption. +- `fo_blogllm`: Rene/Flexoptix technical and founder voice, source-backed blogs, no copied source bodies, no fabricated numbers. +- `tip_llm`: crawler, scraper, parser, robots, search patterns, switch/transceiver/vendor/advisory/forum research. +- `pulso_llm`: customer-facing Flexoptix/switch/transceiver support, product planning, troubleshooting and quote preparation; never invent SKU, price, stock, warranty or compatibility. +- `contact_llm`: public business/contact research with source provenance, schema.org/RDAP/PeeringDB/security.txt awareness, robots/privacy guardrails. + +## Verification + +- JSONL validation passed for all new files. +- `pnpm exec tsx scripts/runpod_dataset_builder.ts` passed outside sandbox after `tsx` IPC needed host permissions. +- `pnpm exec tsx scripts/model_registry_build.ts` passed. +- `git diff --check` passed. + +## Follow-Up + +- Push Magatama training-pool changes to Gitea. +- Deploy or pull the updated builder and training-data on Erik before starting the next training run. +- For the next RunPod training run, verify artifact adoption and version bump per strict success rule.