6 Commits

Author SHA1 Message Date
Rene Fichtmueller
f4afe14af4 feat: add 12 new vendor URL builders to Playwright image scraper
- Nokia, Huawei, Ciena, Moxa, D-Link, Alcatel-Lucent Enterprise,
  Asterfusion, Brocade: passthrough builders (use stored product_page_url)
- NVIDIA Networking: SN-series URL builder (sn5600 → /ethernet-switching/sn5600/)
- Netgear: lowercase model slug builder for /business/wired/switches/fully-managed/
- UfiSpace: hardcoded sitemap-verified URL map (all 6 S9xxx models)
- QCT: hardcoded URL map for T3048-LY8 and T7032-IX1
- Add Nokia banner / Huawei marketing image patterns to GENERIC_IMAGE_PATTERNS
2026-04-21 07:24:11 +02:00
Rene Fichtmueller
8f36eff956 fix(scraper): filter OneTrust/cookie-consent images + skip in img fallback
cdn.cookielaw.org logos appear as the largest DOM image on Dell/Extreme
product pages when the cookie consent overlay is present. Added to both
GENERIC_IMAGE_PATTERNS (isGenericImage filter) and img fallback skipPattern
so the next-largest actual product image can be found.
2026-04-21 06:45:41 +02:00
Rene Fichtmueller
d67fbe31da fix(scraper): fall through to img fallback when og:image is generic/logo
Previously: if og:image existed (even as a Dell logo URL), page.evaluate() returned
early and the img fallback was never tried. Now: meta tags are extracted first, then
isGenericImage() is checked in Node.js, and the img fallback runs if meta image is null
or generic. This allows vendors like Dell (og:image = logo) to still get product images
via the DOM fallback.
2026-04-21 06:36:12 +02:00
Rene Fichtmueller
09d3a60b7c fix(scraper): fix Edgecore/Extreme URL builders, broaden img fallback, fix ENOENT
- buildEdgecoreUrl: /product/<slug>/ (WooCommerce, no .html) with EDGECORE_SLUG_MAP
  for AS7712-32X→as7712-32x-ec, Minipack2→minipack-as8000-open-modular-platform
- buildFortinetUrl: returns null (all pages redirect to generic, no usable og:image)
- buildExtremeUrl: direct product URL (extremenetworks.com/product/<slug>)
- img fallback: remove strict 'product/switch/router/hardware' path requirement;
  now takes largest image >=200x150px excluding flags/icons/spinners — isGenericImage()
  filters hero/banner/logo afterward
- ENOENT fix: unique per-run Crawlee storage dir (timestamp suffix) prevents
  stale request-queue file contamination between back-to-back vendor runs
2026-04-21 06:33:32 +02:00
Rene Fichtmueller
87b9416592 fix(scraper): fix Arista series-level URL builder + bypass Crawlee URL deduplication
- buildAristaUrl() now extracts series prefix (7060X5-32QS → 7060x5-series)
  instead of individual model URLs that lack og:image
- Strip trailing sub-variant 'A' so R3A → R3 series page
- Add uniqueKey: row.id to each request — prevents Crawlee from deduplicating
  models that share the same series URL (e.g. 7060x5-series)
- For Arista: always prefer fresh builder URL over stored product_page_url
  so stale individual-model URLs don't override correct series pages
2026-04-21 06:22:41 +02:00
Rene Fichtmueller
18a9e1346e feat: Playwright image scraper for bot-blocked vendors (Arista/Dell/Edgecore/Fortinet/Extreme) 2026-04-21 06:16:05 +02:00