Using AI Agents for Data Scraping: Bypassing Modern Captchas Ethically

Build AI-powered scraping agents that handle CAPTCHAs ethically using browser automation, solver APIs, and respectful rate limiting.

Problem: Modern Sites Block Your Scrapers Immediately

You built a scraper. It worked for five minutes. Then it hit a CAPTCHA, a Cloudflare wall, or a bot-detection fingerprint check — and stopped dead.

Modern anti-bot systems are genuinely hard to beat with naive requests scripts. But AI agents with browser automation, human-like behavior, and ethical CAPTCHA solvers can handle most of them — without breaking terms of service or the law.

You'll learn:

  • How to build a browser-based scraping agent using Playwright and Python
  • How to use a CAPTCHA solver API (2captcha / CapSolver) the right way
  • How to stay ethical: rate limits, robots.txt checks, and knowing when to stop

Time: 30 min | Level: Advanced


Why This Happens

Static HTTP scrapers (requests, httpx) skip JavaScript execution entirely. Modern sites detect this instantly — no browser fingerprint, no cookie behavior, no JS challenge response. Cloudflare's Turnstile and hCaptcha specifically probe for browser APIs that only real browsers expose.

Common symptoms:

  • 403s or infinite loading after the first few requests
  • HTML that just says "Checking your browser..."
  • CAPTCHAs appearing on every request, not just occasionally
  • requests works locally but fails on cloud IPs (AWS, GCP ranges are blocklisted)

The ethical angle matters here too. "Bypassing" a CAPTCHA isn't inherently wrong — it depends entirely on what you're scraping and why. Public data for research, price comparison, or accessibility tooling is legitimate. Scraping behind a login wall, ignoring robots.txt, or hammering a server at 1000 req/s is not.


Solution

Step 1: Set Up the Environment

You need Python 3.12+, Playwright, and an optional solver API key. CapSolver has a free tier suitable for testing.

pip install playwright httpx python-dotenv capsolver
playwright install chromium

Create a .env file:

CAPSOLVER_API_KEY=your_key_here

Step 2: Build the Base Agent

This agent launches a real Chromium browser, mimics human timing, and checks robots.txt before touching any URL.

import asyncio
import os
from urllib.robotparser import RobotFileParser
from urllib.parse import urlparse
from playwright.async_api import async_playwright
from dotenv import load_dotenv

load_dotenv()


def can_scrape(url: str, user_agent: str = "*") -> bool:
    """Check robots.txt before scraping. Respect the rules."""
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
    rp = RobotFileParser()
    rp.set_url(robots_url)
    try:
        rp.read()
        return rp.can_fetch(user_agent, url)
    except Exception:
        # If robots.txt is unreachable, proceed cautiously
        return True


class ScrapingAgent:
    def __init__(self, headless: bool = True):
        self.headless = headless
        self.browser = None
        self.context = None

    async def start(self):
        self._pw = await async_playwright().start()
        self.browser = await self._pw.chromium.launch(headless=self.headless)
        self.context = await self.browser.new_context(
            # Real browser UA — avoids the "HeadlessChrome" giveaway
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/122.0.0.0 Safari/537.36"
            ),
            viewport={"width": 1280, "height": 800},
            locale="en-US",
        )
        # Mask navigator.webdriver — the single most common bot signal
        await self.context.add_init_script(
            "Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"
        )

    async def stop(self):
        await self.browser.close()
        await self._pw.stop()

    async def fetch(self, url: str, wait_ms: int = 2000) -> str:
        """Fetch a page with human-like delay. Returns page HTML."""
        if not can_scrape(url):
            raise PermissionError(f"robots.txt disallows scraping: {url}")

        page = await self.context.new_page()
        try:
            await page.goto(url, wait_until="networkidle")
            # Human-like pause — critical for evading timing-based detection
            await page.wait_for_timeout(wait_ms)
            return await page.content()
        finally:
            await page.close()

Expected: No errors on import. can_scrape returns False for URLs blocked in robots.txt.

If it fails:

  • playwright install hangs: Run playwright install-deps chromium first (Linux only)
  • ModuleNotFoundError: Confirm you're in the right virtual environment

Step 3: Handle CAPTCHAs with a Solver API

Some sites show CAPTCHAs even to clean browser sessions. For legitimate use cases, solver APIs (which use human workers or ML models) handle these programmatically.

import capsolver

capsolver.api_key = os.getenv("CAPSOLVER_API_KEY")


async def solve_recaptcha_v2(page, site_key: str, page_url: str) -> bool:
    """
    Solve a reCAPTCHA v2 and inject the token.
    Returns True if solved, False if solver quota exceeded.
    """
    try:
        solution = capsolver.solve({
            "type": "ReCaptchaV2Task",
            "websiteURL": page_url,
            "websiteKey": site_key,
        })
        token = solution["gRecaptchaResponse"]

        # Inject token into the hidden textarea reCAPTCHA uses
        await page.evaluate(
            f'document.getElementById("g-recaptcha-response").innerHTML = "{token}";'
        )
        # Trigger the callback reCAPTCHA expects after solving
        await page.evaluate(
            f'grecaptcha.execute("{site_key}", {{action: "submit"}})'
        )
        return True

    except capsolver.exceptions.ApiException as e:
        print(f"Solver failed: {e}")
        return False


async def solve_cloudflare_turnstile(page, site_key: str, page_url: str) -> bool:
    """Handle Cloudflare Turnstile specifically."""
    try:
        solution = capsolver.solve({
            "type": "AntiTurnstileTaskProxyLess",
            "websiteURL": page_url,
            "websiteKey": site_key,
        })
        token = solution["token"]

        await page.evaluate(
            f'document.querySelector("[name=cf-turnstile-response]").value = "{token}";'
        )
        return True

    except Exception as e:
        print(f"Turnstile failed: {e}")
        return False

Why this is ethical: Solver APIs cost real money per solve (typically $1–3 per 1000 solves). The cost naturally limits abuse. You're also not circumventing auth — you're solving the same challenge a human would.

If it fails:

  • ApiException: ERROR_ZERO_BALANCE: Top up your solver account
  • Token injection does nothing: The site may use a newer CAPTCHA version; check their JS for the widget config

Step 4: Add Rate Limiting and Retry Logic

Hammering a site is both unethical and counterproductive — it triggers blocks faster. Use exponential backoff.

import asyncio
import random
from typing import Callable, TypeVar

T = TypeVar("T")


async def with_backoff(
    fn: Callable,
    max_retries: int = 3,
    base_delay: float = 2.0,
) -> T:
    """
    Retry with exponential backoff + jitter.
    Jitter prevents thundering herd if running multiple agents.
    """
    for attempt in range(max_retries):
        try:
            return await fn()
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            print(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay:.1f}s")
            await asyncio.sleep(delay)


class RateLimiter:
    """Token bucket rate limiter — keeps requests/sec under control."""

    def __init__(self, requests_per_second: float = 0.5):
        # Default: 1 request every 2 seconds — polite for most sites
        self.min_interval = 1.0 / requests_per_second
        self._last_call = 0.0

    async def acquire(self):
        now = asyncio.get_event_loop().time()
        wait = self.min_interval - (now - self._last_call)
        if wait > 0:
            await asyncio.sleep(wait)
        self._last_call = asyncio.get_event_loop().time()

Step 5: Put It All Together

async def main():
    urls = [
        "https://example.com/public-data/page-1",
        "https://example.com/public-data/page-2",
    ]

    agent = ScrapingAgent(headless=True)
    limiter = RateLimiter(requests_per_second=0.3)  # ~1 request per 3 seconds

    await agent.start()
    results = []

    try:
        for url in urls:
            await limiter.acquire()

            async def fetch_url():
                return await agent.fetch(url)

            try:
                html = await with_backoff(fetch_url)
                results.append({"url": url, "html": html})
                print(f"✓ Scraped: {url}")
            except PermissionError as e:
                print(f"✗ Blocked by robots.txt: {e}")
            except Exception as e:
                print(f"✗ Failed after retries: {e}")
    finally:
        await agent.stop()

    return results


if __name__ == "__main__":
    asyncio.run(main())

Expected output:

✓ Scraped: https://example.com/public-data/page-1
✓ Scraped: https://example.com/public-data/page-2

Terminal showing successful scrape output Clean output with rate-limited requests — no blocks triggered


Verification

Run a quick sanity check against a known-public test site:

python -c "
import asyncio
from your_module import ScrapingAgent

async def test():
    agent = ScrapingAgent(headless=True)
    await agent.start()
    html = await agent.fetch('https://books.toscrape.com')
    await agent.stop()
    print('Success' if '<title>' in html else 'Failed')

asyncio.run(test())
"

You should see: Successbooks.toscrape.com is a purpose-built scraping sandbox.


The Ethics Checklist

Before running this against any real site, check these off:

  • robots.txt allows it — the code enforces this, but verify manually too
  • Terms of Service permit scraping — many explicitly allow non-commercial public data access
  • You're not behind a login — scraping authenticated content is a different legal territory entirely
  • Rate is reasonable — 0.3–0.5 req/s is polite; anything over 2 req/s needs explicit permission
  • Data use is legitimate — research, price comparison, accessibility, and archiving are generally defensible

If a site returns a 429 or explicitly asks your scraper to stop, honor it. The rate limiter and robots.txt check handle the automated version of this — but human judgment still applies.


What You Learned

  • Playwright with a masked webdriver property clears most basic bot detections
  • CAPTCHA solver APIs handle the cases browser automation alone can't
  • Exponential backoff + jitter keeps you under rate limits without thrashing
  • Checking robots.txt first is both ethical and legally relevant in many jurisdictions

Limitation: This won't beat every anti-bot system. Akamai Bot Manager and Kasada use deep behavioral fingerprinting that requires significantly more effort — and at that point, you should seriously question whether scraping is the right approach vs. an official API.

When NOT to use this: Don't run this against sites that offer a free API for the same data. Don't use it to scrape personal data. Don't use it to circumvent paywalls.


Tested on Python 3.12, Playwright 1.42, CapSolver 1.0.5, Ubuntu 22.04 and macOS Sonoma