Europe/Istanbul
All Projects

Web Scraper — Fast & Precise Data Extractors

Custom site-specific scrapers that are fast because they fetch only what matters, and precise because they lock onto stable anchors in the markup.
August 8, 2025
Python
Playwright
Selenium
aiohttp
httpx
BeautifulSoup
lxml
XPath
Pandas
SQLite
Most scrapers are slow because they load whole pages in a headless browser and then sift through noise.
My approach: identify the exact info targets first (stable anchors, semantic labels, microdata), then decide the cheapest path to them (raw HTTP vs. headless). That’s why these bots are fast and gentle.
Scheduler → Fetcher (aiohttp/httpx | Playwright) → Parser (CSS/XPath)
         → Normalizer (pydantic) → Deduper → Storage (CSV/Parquet/SQLite)
  • Speed: async IO; concurrent fetch/parse pipeline; conditional headless (JS only when required).
  • Precision: CSS/XPath tuned to semantic anchors; fallbacks on structured data (JSON-LD, microdata); schema validation with pydantic.
  • Politeness: robots.txt, rate limits & backoff, ETag/Last-Modified caching, retries.
  • Outputs: JSON/CSV/Parquet/SQLite; optional push to Google Sheets/Notion.
Example (site-specific extractor)


# Playwright + lxml: fetch only the product block and parse it
from playwright.sync_api import sync_playwright
from lxml import html
import pydantic as p

class Product(p.BaseModel):
  title: str
  price: float
  sku: str

with sync_playwright() as pw:
  b = pw.chromium.launch(headless=True)
  pctx = b.new_context()
  page = pctx.new_page()
  page.goto("https://example.com/product/123", wait_until="domcontentloaded")
  tree = html.fromstring(page.content())

  node = tree.xpath("//section[@data-testid='product']")[0]
  item = Product(
      title=node.xpath(".//h1/text()")[0].strip(),
      price=float(node.xpath(".//meta[@itemprop='price']/@content")[0]),
      sku=node.xpath(".//*[@data-sku]/@data-sku")[0]
  )
  print(item.model_dump_json())

What I typically deliver
  • A repeatable scraper package with project config, proxies if needed, and a CLI.
  • Incremental crawl, change detection, webhooks on updates.
  • Clean, validated datasets ready for analytics.
  • Skills demonstrated: async Python, Playwright/Selenium, robust parsing, data modeling, pipelines, deployment hygiene.