Go to file

Rene Fichtmueller b8ec33a09b init: TIPLLM training data repository structure

Auto-generated training data from TIP intelligent crawlers.
Crawler → LLM extraction → Validation → SFT pairs → Fine-tuning → Smarter TIPLLM

2026-04-28 23:36:13 +02:00

crawl-extractions

init: TIPLLM training data repository structure

2026-04-28 23:36:13 +02:00

qa-pairs

init: TIPLLM training data repository structure

2026-04-28 23:36:13 +02:00

stats

init: TIPLLM training data repository structure

2026-04-28 23:36:13 +02:00

validated-specs

init: TIPLLM training data repository structure

2026-04-28 23:36:13 +02:00

README.md

init: TIPLLM training data repository structure

2026-04-28 23:36:13 +02:00

README.md

TIP Training Data

Auto-generated training dataset for TIPLLM fine-tuning. Generated by TIP Intelligent Crawlers — validated, structured, SFT-ready.

Structure

Directory	Content
`crawl-extractions/`	Raw LLM extractions from vendor product pages (JSONL)
`validated-specs/`	Validated transceiver specs with confidence ≥ 0.7 (JSONL)
`qa-pairs/`	SFT question-answer training pairs (JSONL)
`raw-html/`	Cached HTML snippets for offline re-training (gzipped)
`stats/`	Dataset statistics and coverage reports

SFT Format

Each JSONL line in qa-pairs/ follows the SFT format:

{
  "id": "uuid",
  "source": "crawler:vendor-name:url",
  "kind": "sft-jsonl",
  "crawled_at": "2026-04-28T...",
  "confidence": 0.92,
  "messages": [
    {"role": "system", "content": "You are TIP_LLM..."},
    {"role": "user", "content": "Extract transceiver specs from: ..."},
    {"role": "assistant", "content": "{\"part_number\": \"...\", ...}"}
  ]
}

Stats

Updated automatically after each crawler run.