init: TIPLLM training data repository structure
Auto-generated training data from TIP intelligent crawlers. Crawler → LLM extraction → Validation → SFT pairs → Fine-tuning → Smarter TIPLLM
This commit is contained in:
commit
b8ec33a09b
36
README.md
Normal file
36
README.md
Normal file
@ -0,0 +1,36 @@
|
|||||||
|
# TIP Training Data
|
||||||
|
|
||||||
|
Auto-generated training dataset for TIPLLM fine-tuning.
|
||||||
|
Generated by TIP Intelligent Crawlers — validated, structured, SFT-ready.
|
||||||
|
|
||||||
|
## Structure
|
||||||
|
|
||||||
|
| Directory | Content |
|
||||||
|
|-----------|---------|
|
||||||
|
| `crawl-extractions/` | Raw LLM extractions from vendor product pages (JSONL) |
|
||||||
|
| `validated-specs/` | Validated transceiver specs with confidence ≥ 0.7 (JSONL) |
|
||||||
|
| `qa-pairs/` | SFT question-answer training pairs (JSONL) |
|
||||||
|
| `raw-html/` | Cached HTML snippets for offline re-training (gzipped) |
|
||||||
|
| `stats/` | Dataset statistics and coverage reports |
|
||||||
|
|
||||||
|
## SFT Format
|
||||||
|
|
||||||
|
Each JSONL line in `qa-pairs/` follows the SFT format:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"id": "uuid",
|
||||||
|
"source": "crawler:vendor-name:url",
|
||||||
|
"kind": "sft-jsonl",
|
||||||
|
"crawled_at": "2026-04-28T...",
|
||||||
|
"confidence": 0.92,
|
||||||
|
"messages": [
|
||||||
|
{"role": "system", "content": "You are TIP_LLM..."},
|
||||||
|
{"role": "user", "content": "Extract transceiver specs from: ..."},
|
||||||
|
{"role": "assistant", "content": "{\"part_number\": \"...\", ...}"}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Stats
|
||||||
|
|
||||||
|
Updated automatically after each crawler run.
|
||||||
1
crawl-extractions/.gitkeep
Normal file
1
crawl-extractions/.gitkeep
Normal file
@ -0,0 +1 @@
|
|||||||
|
# crawl-extractions — raw LLM extractions from vendor product pages
|
||||||
1
qa-pairs/.gitkeep
Normal file
1
qa-pairs/.gitkeep
Normal file
@ -0,0 +1 @@
|
|||||||
|
# qa-pairs — SFT training pairs for TIPLLM fine-tuning
|
||||||
8
stats/dataset-stats.json
Normal file
8
stats/dataset-stats.json
Normal file
@ -0,0 +1,8 @@
|
|||||||
|
{
|
||||||
|
"total_extractions": 0,
|
||||||
|
"validated_specs": 0,
|
||||||
|
"qa_pairs": 0,
|
||||||
|
"vendors_covered": [],
|
||||||
|
"confidence_distribution": {"high": 0, "medium": 0, "low": 0},
|
||||||
|
"last_updated": "2026-04-28T00:00:00Z"
|
||||||
|
}
|
||||||
1
validated-specs/.gitkeep
Normal file
1
validated-specs/.gitkeep
Normal file
@ -0,0 +1 @@
|
|||||||
|
# validated-specs — confidence >= 0.7 validated transceiver specs
|
||||||
Loading…
x
Reference in New Issue
Block a user