Sometimes the data that your page object receives as input may be invalid.
You can define a
validate_input method in a page object class to check its
input data and determine how to handle invalid input.
validate_input is called on the first execution of
or the first access to a field. In both cases validation
happens early; in the case of fields, it happens before field evaluation.
validate_input is a synchronous method that expects no parameters, and its
outcome may be any of the following:
None, indicating that the input is valid.
Retry, indicating that the input looks like the result of a temporary issue, and that trying to fetch similar input again may result in valid input.
See also Retrying Additional Requests.
UseFallback, indicating that the page object does not support the input, and that an alternative parsing implementation should be tried instead.
For example, imagine you have a page object for website commerce.example, and that commerce.example is built with a popular e-commerce web framework. You could have a generic page object for products of websites using that framework,
FrameworkProductPage, and a more specific page object for commerce.example,
EcommerceExampleProductPagecannot parse a product page, but it looks like it might be a valid product page, you would raise
UseFallbackto try to parse the same product page with
FrameworkProductPage, in case it works.
web-poet does not dictate how to define or use an alternative parsing implementation as fallback. It is up to web-poet frameworks to choose how they implement fallback handling.
Return an item to override the output of the
to_itemmethod and of fields.
For input not matching the expected type of data, returning an item that indicates so is recommended.
For example, if your page object parses an e-commerce product, and the input data corresponds to a list of products rather than a single product, you could return a product item that somehow indicates that it is not a valid product item, such as
def validate_input(self): if self.css('.product-id::text') is not None: return if self.css('.http-503-error'): raise Retry() if self.css('.product'): raise UseFallback() if self.css('.product-list'): return Product(is_valid=False)
You may use fields in your implementation of the
validate_input method, but
only synchronous fields are supported. For example:
class Page(WebPage[Item]): def validate_input(self): if not self.name: raise UseFallback() @field(cached=True) def name(self): return self.css(".product-name ::text")
Cache fields used in the
method, so that when they are used from
to_item they are not
If you implement a custom
to_item method, as long as you are inheriting
ItemPage, you can enable input validation
decorating your custom
to_item method with
from web_poet import validates_input class Page(ItemPage[Item]): @validates_input async def to_item(self): ...
may also be raised from the
to_item method. This could come in handy, for
example, if after you execute some asynchronous code, such as an
additional request, you find out that you need to
retry the original request or use a fallback.
Input Validation Exceptions¶
- exception web_poet.exceptions.PageObjectAction¶
Base class for exceptions that can be raised from a page object to indicate something to be done about that page object.
- exception web_poet.exceptions.Retry¶
The page object found that the input data is partial or empty, and a request retry may provide better input.
- exception web_poet.exceptions.UseFallback¶
The page object cannot extract data from the input, but the input seems valid, so an alternative data extraction implementation for the same item type may succeed.