It is quite easy to write a basic crawler with Python

import requests


The Internet is full of broken promises(cough cough) and dead links, and in a production environment it is not enough

After Googling & iterating a few times, I came up with this one.

DEFAULT_TIMEOUT = 10
from urllib3.util.retry import Retry
def __init__(self, *args, **kwargs):
self.timeout = DEFAULT_TIMEOUT
if "timeout" in kwargs:
self.timeout = kwargs["timeout"]
del kwargs["timeout"]
super().__init__(*args, **kwargs)

def send(self, request, **kwargs):
timeout = kwargs.get("timeout")
if timeout is None:
kwargs["timeout"] = self.timeout
return super().send(request, **kwargs)

def get_http_session(timeout=DEFAULT_TIMEOUT, retry_count=1):
retry_strategy = Retry(
total=retry_count,
raise_on_redirect=True,
status_forcelist=[429, 500, 502, 503, 504],
)
session = requests.Session()
return session

get_http_session(timeout=3, retry_count=3).get("https://google.com")


But in reality, there are urllib's timeout is not honored in a few cases.

It turns out that during the retry process, if the urllib receives Retry-After response header from the server, it will try to honor that header and wait for the advertised amount.

    def sleep(self, response=None):
"""Sleep between retry attempts.

This method will respect a server's Retry-After response header
and sleep the duration of the time requested. If that is not present, it
will use an exponential backoff. By default, the backoff factor is 0 and
this method will return immediately.
"""

slept = self.sleep_for_retry(response)
if slept:
return

self._sleep_backoff()


In my case, that the server was trying to let my crawler Retry-After 600 seconds, and urllib was honoring that header by default.

I fixed my code by adding respect_retry_after_header=False to the Retry instance.

retry_strategy = Retry(
total=retry_count,
raise_on_redirect=True,
status_forcelist=[429, 500, 502, 503, 504],