Input validation

Sometimes the data that your page object receives as input may be invalid.

You can define a validate_input method in a page object class to check its input data and determine how to handle invalid input.

validate_input is called on the first execution of ItemPage.to_item() or the first access to a field. In both cases validation happens early; in the case of fields, it happens before field evaluation.

validate_input is a synchronous method that expects no parameters, and its outcome may be any of the following:

  • Return None, indicating that the input is valid.

  • Raise Retry, indicating that the input looks like the result of a temporary issue, and that trying to fetch similar input again may result in valid input.

    See also Retrying additional requests.

  • Raise UseFallback, indicating that the page object does not support the input, and that an alternative parsing implementation should be tried instead.

    For example, imagine you have a page object for website commerce.example, and that commerce.example is built with a popular e-commerce web framework. You could have a generic page object for products of websites using that framework, FrameworkProductPage, and a more specific page object for commerce.example, EcommerceExampleProductPage. If EcommerceExampleProductPage cannot parse a product page, but it looks like it might be a valid product page, you would raise UseFallback to try to parse the same product page with FrameworkProductPage, in case it works.

    Note

    web-poet does not dictate how to define or use an alternative parsing implementation as fallback. It is up to web-poet frameworks to choose how they implement fallback handling.

  • Return an item to override the output of the to_item method and of fields.

    For input not matching the expected type of data, returning an item that indicates so is recommended.

    For example, if your page object parses an e-commerce product, and the input data corresponds to a list of products rather than a single product, you could return a product item that somehow indicates that it is not a valid product item, such as Product(is_valid=False).

For example:

def validate_input(self):
    if self.css('.product-id::text') is not None:
        return
    if self.css('.http-503-error'):
        raise Retry()
    if self.css('.product'):
        raise UseFallback()
    if self.css('.product-list'):
        return Product(is_valid=False)

You may use fields in your implementation of the validate_input method, but only synchronous fields are supported. For example:

class Page(WebPage[Item]):
    def validate_input(self):
        if not self.name:
            raise UseFallback()

    @field(cached=True)
    def name(self):
        return self.css(".product-name ::text")

Tip

Cache fields used in the validate_input method, so that when they are used from to_item they are not evaluated again.

If you implement a custom to_item method, as long as you are inheriting from ItemPage, you can enable input validation decorating your custom to_item method with validates_input():

from web_poet import validates_input

class Page(ItemPage[Item]):
    @validates_input
    async def to_item(self):
        ...

Retry and UseFallback may also be raised from the to_item method. This could come in handy, for example, if after you execute some asynchronous code, such as an additional request, you find out that you need to retry the original request or use a fallback.

Input Validation Exceptions

exception web_poet.exceptions.PageObjectAction[source]

Base class for exceptions that can be raised from a page object to indicate something to be done about that page object.

exception web_poet.exceptions.Retry[source]

The page object found that the input data is partial or empty, and a request retry may provide better input.

exception web_poet.exceptions.UseFallback[source]

The page object cannot extract data from the input, but the input seems valid, so an alternative data extraction implementation for the same item type may succeed.