TIP Training Data
Auto-generated training dataset for TIPLLM fine-tuning. Generated by TIP Intelligent Crawlers — validated, structured, SFT-ready.
Structure
| Directory | Content |
|---|---|
crawl-extractions/ |
Raw LLM extractions from vendor product pages (JSONL) |
validated-specs/ |
Validated transceiver specs with confidence ≥ 0.7 (JSONL) |
qa-pairs/ |
SFT question-answer training pairs (JSONL) |
raw-html/ |
Cached HTML snippets for offline re-training (gzipped) |
stats/ |
Dataset statistics and coverage reports |
SFT Format
Each JSONL line in qa-pairs/ follows the SFT format:
{
"id": "uuid",
"source": "crawler:vendor-name:url",
"kind": "sft-jsonl",
"crawled_at": "2026-04-28T...",
"confidence": 0.92,
"messages": [
{"role": "system", "content": "You are TIP_LLM..."},
{"role": "user", "content": "Extract transceiver specs from: ..."},
{"role": "assistant", "content": "{\"part_number\": \"...\", ...}"}
]
}
Stats
Updated automatically after each crawler run.
Description
TIPLLM training data — crawler extractions, validated specs, SFT QA pairs for fine-tuning