43 lines
1.1 KiB
Markdown
43 lines
1.1 KiB
Markdown
# TIP Crawlee Python Worker
|
|
|
|
Optional Python crawler worker for Pi/Proxmox/residential nodes.
|
|
|
|
The TypeScript scraper package remains the production crawler core. This package
|
|
exists for isolated worker experiments where Python extraction libraries are a
|
|
better fit. It writes JSONL artifacts; it does not write directly to TIP
|
|
PostgreSQL.
|
|
|
|
## Install
|
|
|
|
```bash
|
|
cd packages/crawlee-python
|
|
python3 -m venv .venv
|
|
. .venv/bin/activate
|
|
python -m pip install -U pip
|
|
python -m pip install -e ".[beautifulsoup]"
|
|
```
|
|
|
|
For browser-based Python workers:
|
|
|
|
```bash
|
|
python -m pip install -e ".[playwright]"
|
|
python -m playwright install chromium
|
|
```
|
|
|
|
## Smoke Run
|
|
|
|
```bash
|
|
python -m tip_crawlee_worker \
|
|
--mode beautifulsoup \
|
|
--url https://crawlee.dev \
|
|
--out /tmp/tip-crawlee-python-smoke.jsonl \
|
|
--max-requests 1
|
|
```
|
|
|
|
## TIP Policy
|
|
|
|
- Use this on Pi/Proxmox/residential nodes first, not as an Erik-heavy crawler.
|
|
- Keep output as JSONL evidence until a deterministic importer validates it.
|
|
- Record useful crawler outcomes in the TIPLLM training pool.
|
|
- Use TIPLLM only for planning/extraction feedback; no external AI.
|