Additional requests

Some websites require page interactions to load some information, such as clicking a button, scrolling down or hovering on some element. These interactions usually trigger background requests that are then loaded using JavaScript.

To extract such data, reproduce those requests using HttpClient. Include HttpClient among the inputs of your page object, and use an asynchronous field or method to call one of its methods.

For example, simulating a click on a button that loads product images could look like:

import attrs
from web_poet import HttpClient, HttpError, field
from zyte_common_items import Image, ProductPage


@attrs.define
class MyProductPage(ProductPage]):
    http: HttpClient

    @field
    def productId(self):
        return self.css("::attr(product-id)").get()

    @field
    async def images(self):
        url = f"https://api.example.com/v2/images?id={self.productId}"
        try:
            response = await self.http.get(url)
        except HttpError:
            return []
        else:
            urls = response.css(".product-images img::attr(src)").getall()
            return [Image(url=url) for url in urls]

Warning

HttpClient should only be used to handle the type of scenarios mentioned above. Using HttpClient for crawling logic would defeat the purpose of web-poet.

Making a request

HttpClient provides multiple asynchronous request methods, such as:

http = HtpClient()
response = await http.get(url)
response = await http.post(url, body=b"...")
response = await http.request(url, method="...")
response = await http.execute(HttpRequest(url, method="..."))

Request methods also accept custom headers and body, for example:

http.post(
    url,
    headers={"Content-Type": "application/json;charset=UTF-8"},
    body=json.dumps({"foo": "bar"}).encode("utf-8"),
)

Request methods may either raise an HttpError or return an HttpResponse. See Working with HttpResponse.

Note

HttpClient methods are expected to follow any redirection except when the request method is HEAD. This means that the HttpResponse that you get is already the end of any redirection trail.

Concurrent requests

To send multiple requests concurrently, use HttpClient.batch_execute, which accepts any number of HttpRequest instances as input, and returns HttpResponse instances (and HttpError instances when using return_exceptions=True) in the input order. For example:

import attrs
from web_poet import HttpClient, HttpError, HttpRequest, field
from zyte_common_items import Image, ProductPage, ProductVariant


@attrs.define
class MyProductPage(ProductPage):
    http: HttpClient

    max_variants = 10

    @field
    def productId(self):
        return self.css("::attr(product-id)").get()

    @field
    async def variants(self):
        requests = [
            HttpRequest(f"https://example.com/api/variant/{self.productId}/{index}")
            for index in range(self.max_variants)
        ]
        responses = await self.http.batch_execute(*requests, return_exceptions=True)
        return [
            ProductVariant(color=response.css("::attr(color)").get())
            for response in responses
            if not isinstance(response, HttpError)
        ]

You can alternatively use asyncio together with HttpClient to handle multiple requests. For example, you can use asyncio.as_completed() to process the first response from a group of requests as early as possible.

Error handling

HttpClient methods may raise an exception of type HttpError or a subclass.

If the response HTTP status code (response.status) is 400 or higher, HttpResponseError is raised. In case of connection errors, TLS errors and similar, HttpRequestError is raised.

HttpError provides access to the offending request, and HttpResponseError also provides access to the offending response.

Retrying additional requests

Input validation allows retrying all inputs from a page object. To retry only additional requests, you must handle retries on your own.

Your code is responsible for retrying additional requests until good response data is received, or until some maximum number of retries is exceeded.

It is up to you to decide what the maximum number of retries should be for a given additional request, based on your experience with the target website.

It is also up to you to decide how to implement retries of additional requests.

One option would be tenacity. For example, to try an additional request 3 times before giving up:

import attrs
from tenacity import retry, stop_after_attempt
from web_poet import HttpClient, HttpError, field
from zyte_common_items import ProductPage


@attrs.define
class MyProductPage(ProductPage):
    http: HttpClient

    @field
    def productId(self):
        return self.css("::attr(product-id)").get()

    @retry(stop=stop_after_attempt(3))
    async def get_images(self):
        return self.http.get(f"https://api.example.com/v2/images?id={self.productId}")

    @field
    async def images(self):
        try:
            response = await self.get_images()
        except HttpError:
            return []
        else:
            urls = response.css(".product-images img::attr(src)").getall()
            return [Image(url=url) for url in urls]

If the reason your additional request fails is outdated or missing data from page object input, do not try to reproduce the request for that input as an additional request. Request fresh input instead.