Scrapy integration — Docs

Proxy middleware

Create a middleware that sets request.meta['proxy'] on every outgoing request. The built-in HttpProxyMiddleware (priority 750) reads that key and applies it.

myspider/middlewares.pypython

import os

class JustProxiesMiddleware:
    """Injects proxy URL into every request before HttpProxyMiddleware."""

    PROXY = os.getenv(
        "PROXY_URL",
        "http://USER:[email protected]:8080",
    )

    def process_request(self, request, spider):
        # Don't override a proxy already set by the spider.
        request.meta.setdefault("proxy", self.PROXY)

myspider/settings.pypython

DOWNLOADER_MIDDLEWARES = {
    # Run our middleware before HttpProxyMiddleware (priority 750).
    "myspider.middlewares.JustProxiesMiddleware": 350,
    "scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": 400,
}

Store the proxy URL in an environment variable or Scrapy's PROXY_URL setting rather than hard-coding credentials in the middleware.

settings.py reference

Full settings block for a typical rotating-proxy scrape job. Adjust concurrency and retry values to your target's tolerance.

myspider/settings.pypython

# ── Proxy ────────────────────────────────────────────────────────────────────
DOWNLOADER_MIDDLEWARES = {
    "myspider.middlewares.JustProxiesMiddleware": 350,
    "scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": 400,
}
PROXY_URL = "http://USER:[email protected]:8080"

# ── Retry ─────────────────────────────────────────────────────────────────────
RETRY_ENABLED    = True
RETRY_TIMES      = 4
RETRY_HTTP_CODES = [429, 500, 502, 503, 504, 522]

# ── Concurrency ───────────────────────────────────────────────────────────────
CONCURRENT_REQUESTS            = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 16
DOWNLOAD_DELAY                 = 0     # rotation handles IP distribution
RANDOMIZE_DOWNLOAD_DELAY       = False

# ── Timeouts ─────────────────────────────────────────────────────────────────
DOWNLOAD_TIMEOUT = 20

# ── Robots ───────────────────────────────────────────────────────────────────
ROBOTSTXT_OBEY = False

Sticky sessions

For flows that need the same exit IP across several requests — login then dashboard, multi-step forms — generate a session token and embed it in the username. Set request.meta['proxy'] explicitly on those requests to override the default.

myspider/spiders/shop.pypython

import secrets
import scrapy

class ShopSpider(scrapy.Spider):
    name = "shop"

    def start_requests(self):
        for account in self.accounts:
            token = secrets.token_hex(8)
            proxy = f"http://USER-session-{token}:[email protected]:8080"

            yield scrapy.Request(
                "https://shop.com/login",
                method="POST",
                body=f"user={account['u']}&pass={account['p']}",
                meta={"proxy": proxy, "token": token, "account": account},
                callback=self.after_login,
            )

    def after_login(self, response):
        # Same proxy token → same exit IP for the session.
        yield scrapy.Request(
            "https://shop.com/orders",
            meta={
                "proxy": response.meta["proxy"],
                "account": response.meta["account"],
            },
            callback=self.parse_orders,
        )

    def parse_orders(self, response):
        ...

Retry policy

Scrapy's built-in retry middleware handles the status codes in RETRY_HTTP_CODES. On rotating products each retry fetches a new exit IP automatically — no extra code needed. For finer control, add a custom RetryMiddleware subclass that implements backoff:

myspider/middlewares.pypython

import time, random
from scrapy.downloadermiddlewares.retry import RetryMiddleware as _Base

class BackoffRetryMiddleware(_Base):
    def process_response(self, request, response, spider):
        if response.status in self.retry_http_codes:
            attempt = request.meta.get("retry_times", 0)
            wait = min(8, 0.4 * (2 ** attempt)) + random.uniform(0, 0.4)
            time.sleep(wait)
        return super().process_response(request, response, spider)

time.sleep blocks the reactor thread in Scrapy — acceptable for small jitter values (< 2 s). For longer waits use reactor.callLater or the scrapy-deltafetch middleware to defer retries.

Concurrency settings

Optimal concurrency depends on the pool type and the target's tolerance. Starting points:

Datacenter rotating — CONCURRENT_REQUESTS=64, CONCURRENT_REQUESTS_PER_DOMAIN=32. Fast pool, low latency.
Residential rotating — CONCURRENT_REQUESTS=32, CONCURRENT_REQUESTS_PER_DOMAIN=16. Higher per-request latency means you need more parallelism to match datacenter throughput.
Mobile — CONCURRENT_REQUESTS=16. Priced by bandwidth; keep concurrency lower to avoid burning it on failed retries.

See high-throughput tuning for the full OS-level and connection-pool discussion.