Input validation
Sometimes the data that your page object receives as input may be invalid.
You can define a validate_input
method in a page object class to check its
input data and determine how to handle invalid input.
validate_input
is called on the first execution of ItemPage.to_item()
or the first access to a field. In both cases validation
happens early; in the case of fields, it happens before field evaluation.
validate_input
is a synchronous method that expects no parameters, and its
outcome may be any of the following:
Return
None
, indicating that the input is valid.
Raise
Retry
, indicating that the input looks like the result of a temporary issue, and that trying to fetch similar input again may result in valid input.See also Retrying additional requests.
Raise
UseFallback
, indicating that the page object does not support the input, and that an alternative parsing implementation should be tried instead.For example, imagine you have a page object for website commerce.example, and that commerce.example is built with a popular e-commerce web framework. You could have a generic page object for products of websites using that framework,
FrameworkProductPage
, and a more specific page object for commerce.example,EcommerceExampleProductPage
. IfEcommerceExampleProductPage
cannot parse a product page, but it looks like it might be a valid product page, you would raiseUseFallback
to try to parse the same product page withFrameworkProductPage
, in case it works.Note
web-poet does not dictate how to define or use an alternative parsing implementation as fallback. It is up to web-poet frameworks to choose how they implement fallback handling.
Return an item to override the output of the
to_item
method and of fields.For input not matching the expected type of data, returning an item that indicates so is recommended.
For example, if your page object parses an e-commerce product, and the input data corresponds to a list of products rather than a single product, you could return a product item that somehow indicates that it is not a valid product item, such as
Product(is_valid=False)
.
For example:
def validate_input(self):
if self.css('.product-id::text') is not None:
return
if self.css('.http-503-error'):
raise Retry()
if self.css('.product'):
raise UseFallback()
if self.css('.product-list'):
return Product(is_valid=False)
You may use fields in your implementation of the validate_input
method, but
only synchronous fields are supported. For example:
class Page(WebPage[Item]):
def validate_input(self):
if not self.name:
raise UseFallback()
@field(cached=True)
def name(self):
return self.css(".product-name ::text")
Tip
Cache fields used in the validate_input
method, so that when they are used from to_item
they are not
evaluated again.
If you implement a custom to_item
method, as long as you are inheriting
from ItemPage
, you can enable input validation
decorating your custom to_item
method with
validates_input()
:
from web_poet import validates_input
class Page(ItemPage[Item]):
@validates_input
async def to_item(self):
...
Retry
and UseFallback
may also be raised from the to_item
method. This could come in handy, for
example, if after you execute some asynchronous code, such as an
additional request, you find out that you need to
retry the original request or use a fallback.
Input Validation Exceptions
- exception web_poet.exceptions.PageObjectAction[source]
Base class for exceptions that can be raised from a page object to indicate something to be done about that page object.