Rene Fichtmueller b8ec33a09b init: TIPLLM training data repository structure
Auto-generated training data from TIP intelligent crawlers.
Crawler → LLM extraction → Validation → SFT pairs → Fine-tuning → Smarter TIPLLM
2026-04-28 23:36:13 +02:00

TIP Training Data

Auto-generated training dataset for TIPLLM fine-tuning. Generated by TIP Intelligent Crawlers — validated, structured, SFT-ready.

Structure

Directory Content
crawl-extractions/ Raw LLM extractions from vendor product pages (JSONL)
validated-specs/ Validated transceiver specs with confidence ≥ 0.7 (JSONL)
qa-pairs/ SFT question-answer training pairs (JSONL)
raw-html/ Cached HTML snippets for offline re-training (gzipped)
stats/ Dataset statistics and coverage reports

SFT Format

Each JSONL line in qa-pairs/ follows the SFT format:

{
  "id": "uuid",
  "source": "crawler:vendor-name:url",
  "kind": "sft-jsonl",
  "crawled_at": "2026-04-28T...",
  "confidence": 0.92,
  "messages": [
    {"role": "system", "content": "You are TIP_LLM..."},
    {"role": "user", "content": "Extract transceiver specs from: ..."},
    {"role": "assistant", "content": "{\"part_number\": \"...\", ...}"}
  ]
}

Stats

Updated automatically after each crawler run.

Description
TIPLLM training data — crawler extractions, validated specs, SFT QA pairs for fine-tuning
Readme 493 KiB