Page objects¶
A page object is a code wrapper for a webpage, or for a part of a webpage, that implements the logic to parse the raw webpage data into structured data.
To use web-poet, define page object classes for your target websites, and get the output item using a web-poet framework.
Defining a page object class¶
A page object class is a Python class that:
Subclasses
ItemPage
.Declares typed input parameters in its
__init__
method.Uses fields.
Alternatively, you can implement a
to_item
method, which can be synchronous or asynchronous, and returns the webpage content as an item.
For example:
from web_poet import HttpResponse, ItemPage, field
class FooPage(ItemPage[MyItem]):
def __init__(self, response: HttpResponse):
self.response = response
@field
def foo(self) -> str:
return self.response.css(".foo").get()
Note
MyItem
in the code examples of this page is a placeholder for an
item class.
Minimizing boilerplate¶
There are a few ways for you to minimize boilerplate when defining a page object class.
For example, you can use attrs to remove the need for a custom __init__
method:
from attrs import define
from web_poet import HttpResponse, ItemPage, field
@define
class FooPage(ItemPage[MyItem]):
response: HttpResponse
@field
def foo(self) -> str:
return self.response.css(".foo").get()
If your page object class needs
HttpResponse
as input, there is also
WebPage
, an ItemPage
subclass
that declares an HttpResponse
input and
provides helper methods to use it:
from web_poet import WebPage, field
class FooPage(WebPage[MyItem]):
@field
def foo(self) -> str:
return self.css(".foo").get()
Getting the output item¶
You should include your page object classes into a page object
registry, e.g. decorate them with handle_urls()
:
from web_poet import WebPage, field, handle_urls
@handle_urls("example.com")
class FooPage(WebPage[MyItem]):
@field
def foo(self) -> str:
return self.css(".foo").get()
Then, provided your page object class code is imported (see
consume_modules()
), your framework
can build the output item after you provide the target URL and the desired
output item class, as shown in the tutorial.
Your framework chooses the right page object class based on your input
parameters, downloads the required data, builds a page object, and calls the
to_item
method of that page object.
Note that, while the examples above use dict
as an output item for
simplicity, using less generic item classes is recommended. That
way, you can use different page object classes, with different output items,
for the same website.
Getting a page object¶
Alternatively, frameworks can return a page object instead of an item, and you
can call to_item
yourself.
However, there are drawbacks to this approach:
to_item
can be synchronous or asynchronous, so you need to useensure_awaitable()
:from web_poet.utils import ensure_awaitable item = await ensure_awaitable(foo_page.to_item())
to_item
may raise certain exceptions, likeRetry
orUseFallback
, which, depending on your framework, may not be handled automatically when getting a page object instead of an item.
Building a page object manually¶
It is possible to create a page object from a page object class passing its
inputs as parameters. For example, to manually create an instance of the
FooPage
page object class defined above:
foo_page = FooPage(
response=HttpResponse(
"https://example.com",
b"<!DOCTYPE html>\n<title>Foo</title>",
),
)
However, your code will break if the page object class changes its inputs. Building page objects using frameworks prevents that.