Page objects

A page object is a code wrapper for a webpage, or for a part of a webpage, that implements the logic to parse the raw webpage data into structured data.

To use web-poet, define page object classes for your target websites, and get the output item using a web-poet framework.

Defining a page object class

A page object class is a Python class that:

  • Subclasses ItemPage.

  • Declares typed input parameters in its __init__ method.

  • Uses fields.

    Alternatively, you can implement a to_item method, which can be synchronous or asynchronous, and returns the webpage content as an item.

For example:

from web_poet import HttpResponse, ItemPage, field


class FooPage(ItemPage[MyItem]):
    def __init__(self, response: HttpResponse):
        self.response = response

    @field
    def foo(self) -> str:
        return self.response.css(".foo").get()

Note

MyItem in the code examples of this page is a placeholder for an item class.

Minimizing boilerplate

There are a few ways for you to minimize boilerplate when defining a page object class.

For example, you can use attrs to remove the need for a custom __init__ method:

from attrs import define

from web_poet import HttpResponse, ItemPage, field


@define
class FooPage(ItemPage[MyItem]):
    response: HttpResponse

    @field
    def foo(self) -> str:
        return self.response.css(".foo").get()

If your page object class needs HttpResponse as input, there is also WebPage, an ItemPage subclass that declares an HttpResponse input and provides helper methods to use it:

from web_poet import WebPage, field


class FooPage(WebPage[MyItem]):
    @field
    def foo(self) -> str:
        return self.css(".foo").get()

Getting the output item

You should include your page object classes into a page object registry, e.g. decorate them with handle_urls():

from web_poet import WebPage, field, handle_urls


@handle_urls("example.com")
class FooPage(WebPage[MyItem]):
    @field
    def foo(self) -> str:
        return self.css(".foo").get()

Then, provided your page object class code is imported (see consume_modules()), your framework can build the output item after you provide the target URL and the desired output item class, as shown in the tutorial.

Your framework chooses the right page object class based on your input parameters, downloads the required data, builds a page object, and calls the to_item method of that page object.

Note that, while the examples above use dict as an output item for simplicity, using less generic item classes is recommended. That way, you can use different page object classes, with different output items, for the same website.

Getting a page object

Alternatively, frameworks can return a page object instead of an item, and you can call to_item yourself.

However, there are drawbacks to this approach:

  • to_item can be synchronous or asynchronous, so you need to use ensure_awaitable():

    from web_poet.utils import ensure_awaitable
    
    item = await ensure_awaitable(foo_page.to_item())
    
  • to_item may raise certain exceptions, like Retry or UseFallback, which, depending on your framework, may not be handled automatically when getting a page object instead of an item.

Building a page object manually

It is possible to create a page object from a page object class passing its inputs as parameters. For example, to manually create an instance of the FooPage page object class defined above:

foo_page = FooPage(
    response=HttpResponse(
        "https://example.com",
        b"<!DOCTYPE html>\n<title>Foo</title>",
    ),
)

However, your code will break if the page object class changes its inputs. Building page objects using frameworks prevents that.