Scraping that does not break every week: how to design more resilient extractors

Technical guide to designing maintainable scrapers in Python: selectors, fallback, validation, logs and real best practices.

Cover for Scraping that does not break every week: how to design more resilient extractors

If you have maintained a scraper for more than three months, you know exactly what this article is about. You build an extractor that works perfectly. You leave it running. Two weeks later, the site changes a CSS class and your pipeline starts returning empty fields. Or worse: it returns incorrect data without you noticing until someone tells you.

I have maintained scrapers in Rolsfera for months and the clearest lesson I have drawn is this: scraping is not hard to set up, it is hard to maintain. The difference between a scraper that survives and one that breaks every week is not in the library you use, but in how you design the extractor to absorb changes.

This article is not an intro to BeautifulSoup or Playwright. I assume you already know how to use both. What I want to share are design patterns, validation and monitoring approaches that I have applied in production and that make the difference when the scraper has to run without your constant supervision.


The underlying problem

A scraper is code that depends on the structure of a system you do not control. That makes it the most fragile type of software you can write. Any change to the target site’s HTML can break your extractor: a div that changes its class, a data- attribute that disappears, pagination that moves from server-side to client-side, a JavaScript wrapper that did not exist before.

Scraping does not break because of programming errors. It breaks because the web changes, and your code assumes it will not.

Most scraping tutorials end when the extractor works the first time. But the first time is the easy part. What matters is what happens on attempt number 50, when something has changed and your system has to keep working or, at the very least, let you know it has stopped.


Anatomy of a maintainable extractor

Before getting into specific patterns, this is the structure I use for each extractor in Rolsfera:

# extractors/base.py
from abc import ABC, abstractmethod
from dataclasses import dataclass
from datetime import datetime
import logging

logger = logging.getLogger(__name__)

@dataclass
class ExtractedArticle:
    title: str
    url: str
    content: str
    published_at: str | None
    author: str | None
    source_name: str
    raw_html: str  # siempre guardo el HTML original

    def is_valid(self) -> bool:
        """Validación mínima: título y URL deben existir."""
        return bool(self.title and self.url and len(self.title) > 5)


class BaseExtractor(ABC):
    def __init__(self, source_name: str):
        self.source_name = source_name
        self.logger = logging.getLogger(f"extractor.{source_name}")

    @abstractmethod
    def extract(self) -> list[ExtractedArticle]:
        pass

    def run(self) -> list[ExtractedArticle]:
        self.logger.info(f"Iniciando extracción de {self.source_name}")
        try:
            articles = self.extract()
            valid = [a for a in articles if a.is_valid()]
            invalid_count = len(articles) - len(valid)

            if invalid_count > 0:
                self.logger.warning(
                    f"{invalid_count} artículos inválidos descartados"
                )

            self.logger.info(
                f"Extracción completada: {len(valid)} artículos válidos"
            )
            return valid

        except Exception as e:
            self.logger.error(
                f"Error en extracción de {self.source_name}: {e}",
                exc_info=True,
            )
            return []

Each source has its own class that inherits from BaseExtractor. The base class handles validation, logging and error handling. The concrete class only needs to implement the site-specific extraction logic.

This pattern has an obvious advantage: when one extractor fails, it does not drag down the rest. And uniform logging allows detecting problems quickly.


Selectors: the first line of defense

How you select elements from the HTML largely determines the scraper’s fragility. These are the principles I follow:

Prefer semantic attributes over CSS classes

CSS classes change frequently, especially on sites that use frameworks with class mangling (Tailwind with purge, CSS modules, styled-components). Semantic attributes (data-*, role, aria-label) are more stable.

from bs4 import BeautifulSoup

# ❌ Frágil: depende de una clase CSS específica
soup.select("div.article-card__title > h2.heading-md")

# ✅ Mejor: usa atributos semánticos
soup.select("[data-testid='article-title']")

# ✅ También bueno: estructura semántica del HTML
soup.select("article > header > h2")

Selector chain with fallback

I never rely on a single selector. For each field, I define a chain of selectors ordered by reliability:

def extract_title(self, article_element) -> str | None:
    """Intenta extraer el título con múltiples estrategias."""
    selectors = [
        "[data-testid='article-title']",
        "article > header h1",
        "h1.entry-title",
        "h2.post-title",
        ".article-title",
    ]

    for selector in selectors:
        element = article_element.select_one(selector)
        if element and element.get_text(strip=True):
            return element.get_text(strip=True)

    # Último recurso: el primer h1 o h2 que encuentre
    for tag in ["h1", "h2"]:
        element = article_element.find(tag)
        if element and element.get_text(strip=True):
            return element.get_text(strip=True)

    return None

When one selector stops working, the next one in the chain takes over. This does not prevent the scraper from eventually breaking, but it gives it resilience against minor changes.

Log which selector works

A detail that makes a difference in maintenance: logging which selector was used for each extraction.

def extract_with_fallback(self, element, selectors: list[str], field_name: str) -> str | None:
    for i, selector in enumerate(selectors):
        result = element.select_one(selector)
        if result and result.get_text(strip=True):
            if i > 0:
                self.logger.warning(
                    f"Campo '{field_name}': selector primario falló, "
                    f"usando fallback #{i}: {selector}"
                )
            return result.get_text(strip=True)

    self.logger.error(f"Campo '{field_name}': todos los selectores fallaron")
    return None

When the primary selector fails and a fallback takes over, the log alerts me. That gives me time to update the selectors before the entire chain stops working.


Validation of extracted data

Extracting data is only half the work. The other half is making sure the data makes sense. I have seen scrapers that kept running (no errors) but returning garbage because the HTML changed in a way that did not break the selectors but did break the content’s semantics.

def validate_article(article: ExtractedArticle) -> list[str]:
    """Devuelve lista de problemas encontrados. Lista vacía = OK."""
    issues = []

    if not article.title or len(article.title) < 10:
        issues.append("Título vacío o sospechosamente corto")
    if not article.url or not article.url.startswith("http"):
        issues.append("URL vacía o con formato inválido")
    if not article.content or len(article.content) < 100:
        issues.append("Contenido vacío o sospechosamente corto")
    elif article.content.count("<") > 10:
        issues.append("Contenido parece contener HTML sin limpiar")

    if article.published_at:
        try:
            pub_date = datetime.fromisoformat(article.published_at)
            if pub_date > datetime.utcnow() + timedelta(days=1):
                issues.append("Fecha de publicación en el futuro")
        except ValueError:
            issues.append(f"Formato de fecha inválido: {article.published_at}")

    return issues

Validation does not only detect obvious errors. It also detects subtle problems: content that looks like uncleaned HTML, a date in the future (probably a parsing error), a title that is too short which could be a fragment.

If the validator finds issues, the article is not automatically discarded. It gets flagged for manual review. Sometimes a validation problem is a signal that the scraper needs adjustments, not that the article is bad.


Log and alert strategy

A scraper’s logs need to answer three questions:

  1. Did the extractor run correctly?
  2. How many articles did it extract (and is it a reasonable number)?
  3. Was there anything unusual that I should review?
class ExtractionReport:
    def __init__(self, source_name: str):
        self.source_name = source_name
        self.articles_found = 0
        self.articles_valid = 0
        self.fallbacks_used = 0
        self.errors = []

    def health(self) -> str:
        if self.errors:
            return "error"
        if self.articles_found == 0:
            return "warning"
        if self.fallbacks_used > 0:
            return "degraded"
        return "healthy"

The health field is what I use for alerts. If an extractor is in error or degraded for more than two consecutive runs, n8n sends me a Telegram message letting me know. I do not need to actively review logs: the system alerts me when something goes wrong.

A metric that turned out to be very useful is the historical article count per source. If an extractor that normally returns 5-10 articles suddenly returns 0 for three consecutive runs, something has changed. The site probably modified its structure or blocked the scraper.

def check_extraction_anomaly(source_name: str, current_count: int) -> bool:
    """Compara con la media histórica para detectar anomalías."""
    avg = get_historical_average(source_name, days=7)
    if avg == 0:
        return False
    # Si el resultado actual es menos del 20% de la media, es anómalo
    return current_count < avg * 0.2

When to use RSS before scraping

Before building a scraper, I always check whether the site offers an RSS feed. It is a rule I imposed on myself after wasting time maintaining scrapers for sites that had a perfectly functional RSS feed.

CriterionRSSScraping
StabilityHigh (standard format)Low (depends on HTML)
MaintenanceMinimalConstant
Full contentSometimes (depends on the feed)Yes (if the selector is correct)
Ethics/legalityAlways permittedGray area
Structured dataLimited (title, date, summary)Flexible (whatever you can extract)
Implementation speedMinutesHours

My rule is: RSS first, scraping only when there is no alternative or when I need data the feed does not include.

In practice, I often use both: RSS to detect new articles (it is the most reliable source for that) and scraping to extract the full content when the feed only includes an excerpt.

class HybridExtractor(BaseExtractor):
    """Usa RSS para descubrir URLs y scraping para extraer contenido."""

    def extract(self) -> list[ExtractedArticle]:
        feed = feedparser.parse(self.feed_url)
        articles = []
        for entry in feed.entries:
            url = entry.get("link", "")
            if not url or is_already_extracted(url):
                continue
            # RSS para descubrir, scraping para contenido completo
            full_content = self._scrape_full_article(url)
            articles.append(ExtractedArticle(
                title=entry.get("title", ""),
                url=url,
                content=full_content or entry.get("summary", ""),
                published_at=entry.get("published", ""),
                author=entry.get("author", None),
                source_name=self.source_name,
                raw_html=full_content or "",
            ))
        return articles

Best practices (the real ones, not the textbook ones)

Realistic headers

A beginner mistake I still see is not setting HTTP headers. Many sites block requests without a User-Agent or with the requests default (which identifies itself as python-requests).

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
        "(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9",
    "Accept-Language": "en-US,en;q=0.9,es;q=0.8",
    "Accept-Encoding": "gzip, deflate, br",
}

Delays between requests

I do not hammer servers. Between requests to the same domain I use a 2-5 second delay. It is a matter of ethics and survival: if a site detects aggressive scraping patterns, it blocks you.

import time
import random

def polite_request(url: str, min_delay: float = 2.0, max_delay: float = 5.0):
    time.sleep(random.uniform(min_delay, max_delay))
    return requests.get(url, headers=HEADERS, timeout=15)

Respect robots.txt

I check the robots.txt before building a scraper for a new site. Python has urllib.robotparser for this. If the path is blocked, I do not scrape it. It is not just a legal matter. It is respect for other people’s work. If a site explicitly says it does not want to be scraped, you need to find an alternative (RSS, API, direct contact).

Always save the original HTML

I save the raw HTML from each extraction. It takes up space, but it has saved me several times: when a selector breaks, I can reprocess the stored HTML with the new selector without having to re-download the pages.


Fallback pattern when the structure changes

When a site changes its HTML, the extractor breaks. The question is: how does it recover?

My pattern has three levels:

Level 1: Selector chain. Already explained above. Multiple selectors ordered by priority. Covers minor changes.

Level 2: Generic extraction. If all specific selectors fail, I fall back to a generic extractor that looks for content in standard semantic tags (article, main, [role='main'], .content). If that does not work either, it looks for the longest text block on the page. It is rough, but in many cases it extracts reasonable content.

Level 3: Alert and degradation. If the generic extraction also fails to get reasonable content, the article is saved with only the RSS data (title, URL, short summary) and the system alerts me to update the extractor manually.

def extract_with_fallback_levels(self, url: str, rss_data: dict) -> ExtractedArticle:
    # Nivel 1: selectores específicos
    content = self._extract_with_selectors(url)
    if content:
        return self._build_article(content, rss_data, method="specific")
    # Nivel 2: extracción genérica
    content = self._extract_generic(url)
    if content:
        self.logger.warning(f"Usando extracción genérica para {url}")
        return self._build_article(content, rss_data, method="generic")
    # Nivel 3: solo datos de RSS + alerta
    self.logger.error(f"Extracción fallida, usando solo datos RSS")
    self._send_alert(f"Extractor roto para {self.source_name}")
    return self._build_article(rss_data["summary"], rss_data, method="rss_only")

This pattern does not solve the root problem (you need to update the selectors eventually), but it buys you time. Instead of the pipeline stopping completely, it degrades progressively and alerts you.


When to use Playwright instead of BeautifulSoup

The rule is simple: if the content is in the HTML the server returns, BeautifulSoup. If the content loads with JavaScript after the initial page load, Playwright.

from playwright.sync_api import sync_playwright

def scrape_with_playwright(url: str) -> str | None:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        try:
            page.goto(url, wait_until="networkidle", timeout=15000)
            # Esperar a que el contenido principal cargue
            page.wait_for_selector("article", timeout=5000)
            content = page.query_selector("article")
            return content.inner_text() if content else None
        except Exception as e:
            logger.error(f"Playwright error: {e}")
            return None
        finally:
            browser.close()

Playwright is slower and consumes more resources. I reserve it for the sites that truly need it. In Rolsfera, only 3 of the 40+ sources require Playwright. The rest work with BeautifulSoup and standard HTTP requests.


Final thoughts

Scraping is a powerful but uncomfortable tool. It works until it stops working, and the question is not whether it will break, but when and how much it will cost you to fix it.

What I have learned maintaining scrapers for months is that the investment is not in writing the initial extractor, but in building the layers around it: validation, fallbacks, logs, alerts and the discipline of treating each extractor as a component that will fail.

If you take away just one idea from this article: do not build scrapers that assume stability. Build scrapers that assume change and are designed to degrade gracefully when that change arrives.

OshyTech

Backend and data engineering focused on scalable systems, automation, and AI.

Navigation

Copyright 2026 OshyTech. All Rights Reserved