It is quite easy to write a basic crawler with Python

import requests
requests.get("https://google.com")

The Internet is full of broken promises(cough cough) and dead links, and in a production environment it is not enough

After Googling & iterating a few times, I came up with this one.

DEFAULT_TIMEOUT = 10
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
class TimeoutHTTPAdapter(HTTPAdapter):
    def __init__(self, *args, **kwargs):
        self.timeout = DEFAULT_TIMEOUT
        if "timeout" in kwargs:
            self.timeout = kwargs["timeout"]
            del kwargs["timeout"]
        super().__init__(*args, **kwargs)

    def send(self, request, **kwargs):
        timeout = kwargs.get("timeout")
        if timeout is None:
            kwargs["timeout"] = self.timeout
        return super().send(request, **kwargs)


def get_http_session(timeout=DEFAULT_TIMEOUT, retry_count=1):
    retry_strategy = Retry(
        total=retry_count,
        raise_on_redirect=True,
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["HEAD", "GET", "OPTIONS"]
    )
    session = requests.Session()
    adapter = TimeoutHTTPAdapter(timeout=timeout, max_retries=retry_strategy)
    session.mount("https://", adapter)
    session.mount("http://", adapter)
    return session
get_http_session(timeout=3, retry_count=3).get("https://google.com")

But in reality, there are urllib's timeout is not honored in a few cases. 600secs

It turns out that during the retry process, if the urllib receives Retry-After response header from the server, it will try to honor that header and wait for the advertised amount.

    def sleep(self, response=None):
    """Sleep between retry attempts.

    This method will respect a server's ``Retry-After`` response header
    and sleep the duration of the time requested. If that is not present, it
    will use an exponential backoff. By default, the backoff factor is 0 and
    this method will return immediately.
    """

    if self.respect_retry_after_header and response:
        slept = self.sleep_for_retry(response)
        if slept:
            return

    self._sleep_backoff()

In my case, that the server was trying to let my crawler Retry-After 600 seconds, and urllib was honoring that header by default.

I fixed my code by adding respect_retry_after_header=False to the Retry instance.

retry_strategy = Retry(
    total=retry_count,
    raise_on_redirect=True,
    status_forcelist=[429, 500, 502, 503, 504],
    respect_retry_after_header=False,
    allowed_methods=["HEAD", "GET", "OPTIONS"]
)