Proxy middleware
Create a middleware that sets request.meta['proxy'] on every outgoing request. The built-in HttpProxyMiddleware (priority 750) reads that key and applies it.
import os
class JustProxiesMiddleware:
"""Injects proxy URL into every request before HttpProxyMiddleware."""
PROXY = os.getenv(
"PROXY_URL",
"http://USER:[email protected]:8080",
)
def process_request(self, request, spider):
# Don't override a proxy already set by the spider.
request.meta.setdefault("proxy", self.PROXY)
DOWNLOADER_MIDDLEWARES = {
# Run our middleware before HttpProxyMiddleware (priority 750).
"myspider.middlewares.JustProxiesMiddleware": 350,
"scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": 400,
}
PROXY_URL setting rather than hard-coding credentials in the middleware.settings.py reference
Full settings block for a typical rotating-proxy scrape job. Adjust concurrency and retry values to your target's tolerance.
# ── Proxy ────────────────────────────────────────────────────────────────────
DOWNLOADER_MIDDLEWARES = {
"myspider.middlewares.JustProxiesMiddleware": 350,
"scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": 400,
}
PROXY_URL = "http://USER:[email protected]:8080"
# ── Retry ─────────────────────────────────────────────────────────────────────
RETRY_ENABLED = True
RETRY_TIMES = 4
RETRY_HTTP_CODES = [429, 500, 502, 503, 504, 522]
# ── Concurrency ───────────────────────────────────────────────────────────────
CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 16
DOWNLOAD_DELAY = 0 # rotation handles IP distribution
RANDOMIZE_DOWNLOAD_DELAY = False
# ── Timeouts ─────────────────────────────────────────────────────────────────
DOWNLOAD_TIMEOUT = 20
# ── Robots ───────────────────────────────────────────────────────────────────
ROBOTSTXT_OBEY = False
Sticky sessions
For flows that need the same exit IP across several requests — login then dashboard, multi-step forms — generate a session token and embed it in the username. Set request.meta['proxy'] explicitly on those requests to override the default.
import secrets
import scrapy
class ShopSpider(scrapy.Spider):
name = "shop"
def start_requests(self):
for account in self.accounts:
token = secrets.token_hex(8)
proxy = f"http://USER-session-{token}:[email protected]:8080"
yield scrapy.Request(
"https://shop.com/login",
method="POST",
body=f"user={account['u']}&pass={account['p']}",
meta={"proxy": proxy, "token": token, "account": account},
callback=self.after_login,
)
def after_login(self, response):
# Same proxy token → same exit IP for the session.
yield scrapy.Request(
"https://shop.com/orders",
meta={
"proxy": response.meta["proxy"],
"account": response.meta["account"],
},
callback=self.parse_orders,
)
def parse_orders(self, response):
...
Retry policy
Scrapy's built-in retry middleware handles the status codes in RETRY_HTTP_CODES. On rotating products each retry fetches a new exit IP automatically — no extra code needed. For finer control, add a custom RetryMiddleware subclass that implements backoff:
import time, random
from scrapy.downloadermiddlewares.retry import RetryMiddleware as _Base
class BackoffRetryMiddleware(_Base):
def process_response(self, request, response, spider):
if response.status in self.retry_http_codes:
attempt = request.meta.get("retry_times", 0)
wait = min(8, 0.4 * (2 ** attempt)) + random.uniform(0, 0.4)
time.sleep(wait)
return super().process_response(request, response, spider)
time.sleep blocks the reactor thread in Scrapy — acceptable for small jitter values (< 2 s). For longer waits use reactor.callLater or the scrapy-deltafetch middleware to defer retries.Concurrency settings
Optimal concurrency depends on the pool type and the target's tolerance. Starting points:
- Datacenter rotating —
CONCURRENT_REQUESTS=64,CONCURRENT_REQUESTS_PER_DOMAIN=32. Fast pool, low latency. - Residential rotating —
CONCURRENT_REQUESTS=32,CONCURRENT_REQUESTS_PER_DOMAIN=16. Higher per-request latency means you need more parallelism to match datacenter throughput. - Mobile —
CONCURRENT_REQUESTS=16. Priced by bandwidth; keep concurrency lower to avoid burning it on failed retries.
See high-throughput tuning for the full OS-level and connection-pool discussion.