init: TIPLLM training data repository structure

Auto-generated training data from TIP intelligent crawlers. Crawler → LLM extraction → Validation → SFT pairs → Fine-tuning → Smarter TIPLLM
2026-04-28 23:36:13 +02:00 · 2026-04-28 23:36:13 +02:00 · b8ec33a09b
commit b8ec33a09b
5 changed files with 47 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,36 @@
 # TIP Training Data
 Auto-generated training dataset for TIPLLM fine-tuning.
 Generated by TIP Intelligent Crawlers — validated, structured, SFT-ready.
 ## Structure
 | Directory | Content |
 |-----------|---------|
 | `crawl-extractions/` | Raw LLM extractions from vendor product pages (JSONL) |
 | `validated-specs/` | Validated transceiver specs with confidence ≥ 0.7 (JSONL) |
 | `qa-pairs/` | SFT question-answer training pairs (JSONL) |
 | `raw-html/` | Cached HTML snippets for offline re-training (gzipped) |
 | `stats/` | Dataset statistics and coverage reports |
 ## SFT Format
 Each JSONL line in `qa-pairs/` follows the SFT format:
 ```json
 {
  "id": "uuid",
  "source": "crawler:vendor-name:url",
  "kind": "sft-jsonl",
  "crawled_at": "2026-04-28T...",
  "confidence": 0.92,
  "messages": [
    {"role": "system", "content": "You are TIP_LLM..."},
    {"role": "user", "content": "Extract transceiver specs from: ..."},
    {"role": "assistant", "content": "{\"part_number\": \"...\", ...}"}
  ]
 }
 ```
 ## Stats
 Updated automatically after each crawler run.
--- a/crawl-extractions/.gitkeep
+++ b/crawl-extractions/.gitkeep
@ -0,0 +1 @@
 # crawl-extractions — raw LLM extractions from vendor product pages
--- a/qa-pairs/.gitkeep
+++ b/qa-pairs/.gitkeep
@ -0,0 +1 @@
 # qa-pairs — SFT training pairs for TIPLLM fine-tuning
--- a/stats/dataset-stats.json
+++ b/stats/dataset-stats.json
@ -0,0 +1,8 @@
 {
  "total_extractions": 0,
  "validated_specs": 0,
  "qa_pairs": 0,
  "vendors_covered": [],
  "confidence_distribution": {"high": 0, "medium": 0, "low": 0},
  "last_updated": "2026-04-28T00:00:00Z"
 }
--- a/validated-specs/.gitkeep
+++ b/validated-specs/.gitkeep
@ -0,0 +1 @@
 # validated-specs — confidence >= 0.7 validated transceiver specs
		`@ -0,0 +1 @@`
							`# crawl-extractions — raw LLM extractions from vendor product pages`
		`@ -0,0 +1 @@`
							`# qa-pairs — SFT training pairs for TIPLLM fine-tuning`
		`@ -0,0 +1 @@`
							`# validated-specs — confidence >= 0.7 validated transceiver specs`