# TIP Training Data Auto-generated training dataset for TIPLLM fine-tuning. Generated by TIP Intelligent Crawlers — validated, structured, SFT-ready. ## Structure | Directory | Content | |-----------|---------| | `crawl-extractions/` | Raw LLM extractions from vendor product pages (JSONL) | | `validated-specs/` | Validated transceiver specs with confidence ≥ 0.7 (JSONL) | | `qa-pairs/` | SFT question-answer training pairs (JSONL) | | `raw-html/` | Cached HTML snippets for offline re-training (gzipped) | | `stats/` | Dataset statistics and coverage reports | ## SFT Format Each JSONL line in `qa-pairs/` follows the SFT format: ```json { "id": "uuid", "source": "crawler:vendor-name:url", "kind": "sft-jsonl", "crawled_at": "2026-04-28T...", "confidence": 0.92, "messages": [ {"role": "system", "content": "You are TIP_LLM..."}, {"role": "user", "content": "Extract transceiver specs from: ..."}, {"role": "assistant", "content": "{\"part_number\": \"...\", ...}"} ] } ``` ## Stats Updated automatically after each crawler run.