Web Scraper — Fast & Precise Data Extractors

Custom site-specific scrapers that are fast because they fetch only what matters, and precise because they lock onto stable anchors in the markup.

August 8, 2025

Python

Playwright

Selenium

aiohttp

httpx

BeautifulSoup

lxml

XPath

Pandas

SQLite

Why this project

Most scrapers are slow because they load whole pages in a headless browser and then sift through noise.
My approach: identify the exact info targets first (stable anchors, semantic labels, microdata), then decide the cheapest path to them (raw HTTP vs. headless). That’s why these bots are fast and gentle.

Architecture

Scheduler → Fetcher (aiohttp/httpx | Playwright) → Parser (CSS/XPath)
         → Normalizer (pydantic) → Deduper → Storage (CSV/Parquet/SQLite)

Speed: async IO; concurrent fetch/parse pipeline; conditional headless (JS only when required).
Precision: CSS/XPath tuned to semantic anchors; fallbacks on structured data (JSON-LD, microdata); schema validation with pydantic.
Politeness: robots.txt, rate limits & backoff, ETag/Last-Modified caching, retries.
Outputs: JSON/CSV/Parquet/SQLite; optional push to Google Sheets/Notion.

Example (site-specific extractor)



# Playwright + lxml: fetch only the product block and parse it
from playwright.sync_api import sync_playwright
from lxml import html
import pydantic as p

class Product(p.BaseModel):
  title: str
  price: float
  sku: str

with sync_playwright() as pw:
  b = pw.chromium.launch(headless=True)
  pctx = b.new_context()
  page = pctx.new_page()
  page.goto("https://example.com/product/123", wait_until="domcontentloaded")
  tree = html.fromstring(page.content())

  node = tree.xpath("//section[@data-testid='product']")[0]
  item = Product(
      title=node.xpath(".//h1/text()")[0].strip(),
      price=float(node.xpath(".//meta[@itemprop='price']/@content")[0]),
      sku=node.xpath(".//*[@data-sku]/@data-sku")[0]
  )
  print(item.model_dump_json())

What I typically deliver

A repeatable scraper package with project config, proxies if needed, and a CLI.
Incremental crawl, change detection, webhooks on updates.
Clean, validated datasets ready for analytics.
Skills demonstrated: async Python, Playwright/Selenium, robust parsing, data modeling, pipelines, deployment hygiene.