Directory	Content
`crawl-extractions/`	Raw LLM extractions from vendor product pages (JSONL)
`validated-specs/`	Validated transceiver specs with confidence ≥ 0.7 (JSONL)
`qa-pairs/`	SFT question-answer training pairs (JSONL)
`raw-html/`	Cached HTML snippets for offline re-training (gzipped)
`stats/`	Dataset statistics and coverage reports

SFT Format

Each JSONL line in qa-pairs/ follows the SFT format:

{
  "id": "uuid",
  "source": "crawler:vendor-name:url",
  "kind": "sft-jsonl",
  "crawled_at": "2026-04-28T...",
  "confidence": 0.92,
  "messages": [
    {"role": "system", "content": "You are TIP_LLM..."},
    {"role": "user", "content": "Extract transceiver specs from: ..."},
    {"role": "assistant", "content": "{\"part_number\": \"...\", ...}"}
  ]
}

Stats

Updated automatically after each crawler run.