Fields¶
Background¶
It is common for Page Objects not to put all the extraction code to the
to_item()
method, but create properties or methods to extract
individual attributes, a method or property per attribute:
import attrs
from web_poet import ItemPage, HttpResponse
@attrs.define
class MyPage(ItemPage):
response: HttpResponse
@property
def name(self):
return self.response.css(".name").get()
@property
def price(self):
return self.response.css(".price").get()
def to_item(self) -> dict:
return {
'name': self.name,
'price': self.price
}
This approach has 2 main advantages:
Often the code looks cleaner this way, it’s easier to follow.
The resulting page object becomes more flexible and reusable: if not all data extracted in the
to_item()
method is needed, user can use properties for individual attributes. It’s more efficient than runningto_item()
and only using some of the result.
However, writing and maintaining to_item()
method can get tedious,
especially if there is a lot of properties.
@field decorator¶
To aid writing Page Objects in this style, web-poet
provides
the @web_poet.field
decorator:
import attrs
from web_poet import ItemPage, HttpResponse, field
@attrs.define
class MyPage(ItemPage):
response: HttpResponse
@field
def name(self):
return self.response.css(".name").get()
@field
def price(self):
return self.response.css(".price").get()
ItemPage
has a default to_item()
implementation: it uses all the properties created with the
@field
decorator, and returns
a dict with the result, where keys are method names, and values are
property values. In the example above, to_item()
returns a
{"name": ..., "price": ...}
dict with the extracted data.
Methods annotated with the @field
decorator
become properties; for a page = MyPage(...)
instance
you can access them as page.name
.
It’s important to note that the default
ItemPage.to_item()
implementation
is an async def
function - make sure to await its result:
item = await page.to_item()
Asynchronous fields¶
The reason ItemPage
provides an async to_item
method by
default is that both regular and async def
fields are supported.
For example, you might need to send Additional requests to extract some of the attributes:
import attrs
from web_poet import ItemPage, HttpResponse, HttpClient, field
@attrs.define
class MyPage(ItemPage):
response: HttpResponse
http: HttpClient
@field
def name(self):
return self.response.css(".name").get()
@field
async def price(self):
resp = await self.http.get("...")
return resp.json()['price']
Using Page Objects with async fields¶
If you want to use a Page Object with async fields without calling its
to_item
method, make sure to await the field when needed, and
not await it when that’s not needed:
page = MyPage(...)
name = page.name
price = await page.price
This is not ideal, because now the code which needs to use a page object
must be aware if a field is sync or async. If a field needs to be changed
from being sync to async def
(or the other way around),
e.g. because of a website change, all the code which uses this page
object must be updated.
One approach to solve it is to always define all fields as async def
.
It works, but it makes the page objects harder to use in non-async environments.
Instead of doing this, you can also use ensure_awaitable()
utility
function when accessing the fields:
from web_poet.utils import ensure_awaitable
page = MyPage(...)
name = await ensure_awaitable(page.name)
price = await ensure_awaitable(page.price)
Now any field can be converted from sync to async, or the other way around, and the code would keep working.
Field processors¶
It’s often needed to clean or process field values using reusable functions.
@field
takes an optional out
argument with a list of such functions.
They will be applied to the field value before returning it:
from web_poet import ItemPage, HttpResponse, field
def clean_tabs(s):
return s.replace('\t', ' ')
def add_brand(s, page):
return f"{page.brand} - {s}"
class MyPage(ItemPage):
response: HttpResponse
@field(out=[clean_tabs, str.strip, add_brand])
def name(self):
return self.response.css(".name ::text").get()
@field(cached=True)
def brand(self):
return self.response.css(".brand ::text").get()
If a processor takes an argument named page
it will receive the page object
instance in it so that values of other fields can be used. You should enable
caching for fields accessed in processors to avoid unnecessary recomputations.
Be careful of circular references, as accessing a field will run processors for
it, and if two fields reference each other, RecursionError
will be
raised.
Note that while processors can be applied to async fields, they need to be sync functions themselves. This also means that only values of sync fields can be accessed in processors.
It’s also possible to implement field cleaning and processing in to_item
but in that case accessing a field directly will return the value without
processing, so it’s preferable to use field processors instead.
Default processors¶
You can also define processors on the page level by defining a nested class
named Processors
:
import attrs
from web_poet import ItemPage, HttpResponse, field
def clean_tabs(s):
return s.replace('\t', ' ')
@attrs.define
class MyPage(ItemPage):
response: HttpResponse
class Processors:
name = [clean_tabs, str.strip]
@field
def name(self):
return self.response.css(".name ::text").get()
If Processors
contains an attribute with the same name as a field, its
value will be used as processors for that field unless the processors are
specified in the out
argument for it.
You can also reuse and extend the processors defined in a base class by
explicitly accessing or subclassing the Processors
class:
import attrs
from web_poet import ItemPage, HttpResponse, field
def clean_tabs(s):
return s.replace('\t', ' ')
@attrs.define
class MyPage(ItemPage):
response: HttpResponse
class Processors:
name = [str.strip]
@field
def name(self):
return self.response.css(".name ::text").get()
class MyPage2(MyPage):
class Processors(MyPage.Processors):
# name uses the processors in MyPage.Processors.name
# description now also uses them and also clean_tabs
description = MyPage.Processors.name + [clean_tabs]
@field
def description(self):
return self.response.css(".description ::text").get()
# brand uses the same processors as name
@field(out=MyPage.Processors.name)
def brand(self):
return self.response.css(".brand ::text").get()
Processors for nested fields¶
Some item fields contain nested items (e.g. a product can contain a list of
variants) and it’s useful to have processors for fields of these nested items.
You can use the same logic for them as for normal fields if you define an
extractor class that produces these nested items. Such classes should inherit
from Extractor
. In the simplest cases you need to pass a selector to
them:
import attrs
from parsel import Selector
from web_poet import Extractor, ItemPage, HttpResponse, field
@attrs.define
class MyPage(ItemPage):
response: HttpResponse
@field
async def variants(self):
variants = []
for color_sel in self.response.css(".color"):
variant = await VariantExtractor(color_sel).to_item()
variants.append(variant)
return variants
@attrs.define
class VariantExtractor(Extractor):
sel: Selector
@field(out=[str.strip])
def color(self):
return self.sel.css(".name::text")
In such cases you can also use SelectorExtractor
as a shortcut that
provides css()
and xpath()
:
class VariantExtractor(SelectorExtractor):
@field(out=[str.strip])
def color(self):
return self.css(".name::text")
You can also pass other data in addition to, or instead of, selectors, such as dictionaries with some data:
@attrs.define
class VariantExtractor(Extractor):
variant_data: dict
@field(out=[str.strip])
def color(self):
return self.variant_data["color"]
Item Classes¶
In all previous examples, to_item
methods are returning dict
instances. It is common to use item classes (e.g. dataclasses or
attrs instances) instead of unstructured dicts to hold the data:
import attrs
from web_poet import ItemPage, HttpResponse, validates_input
@attrs.define
class Product:
name: str
price: str
@attrs.define
class ProductPage(ItemPage):
# ...
@validates_input
def to_item(self) -> Product:
return Product(
name=self.name,
price=self.price
)
web_poet.fields
supports it, by allowing to parametrize
ItemPage
with an item class:
@attrs.define
class ProductPage(ItemPage[Product]):
# ...
When ItemPage
is parametrized with an item class,
its to_item()
method starts to return item instances, instead
of dict
instances. In the example above ProductPage.to_item
method
returns Product
instances.
Defining an item class may be an overkill if you only have a single Page Object, but item classes are of a great help when
you need to extract data in the same format from multiple websites, or
if you want to define the schema upfront.
Error prevention¶
Item classes play particularly well with the
@field
decorator, preventing some of the errors,
which may happen if results are plain “dicts”.
Consider the following badly written page object:
import attrs
from web_poet import ItemPage, HttpResponse, field
@attrs.define
class Product:
name: str
price: str
@attrs.define
class ProductPage(ItemPage[Product]):
response: HttpResponse
@field
def nane(self):
return self.response.css(".name").get()
Because the Product
item class is used, a typo (“nane” instead of “name”)
is detected at runtime: the creation of a Product
instance would fail with
a TypeError
, because of the unexpected keyword argument “nane”.
After fixing it (renaming “nane” method to “name”), another error is going to be
detected: the price
argument is required, but there is no extraction method for
this attribute, so Product.__init__
will raise another TypeError
,
indicating that a required argument is missing.
Without an item class, none of these errors are detected.
Changing Item Class¶
Let’s say there is a Page Object implemented, which outputs some standard item. Maybe there is a library of such Page Objects available. But for a particular project we might want to output an item of a different type:
some attributes of the standard item might not be needed;
there might be a need to implement extra attributes, which are not available in the standard item;
names of attributes might be different.
There are a few ways to approach it. If items are very different, using the original Page Object as a dependency is a good approach:
import attrs
from my_library import FooPage, StandardItem
from web_poet import ItemPage, HttpResponse, field, ensure_awaitable
@attrs.define
class CustomItem:
new_name: str
new_price: str
@attrs.define
class CustomFooPage(ItemPage[CustomItem]):
response: HttpResponse
standard: FooPage
@field
async def new_name(self):
orig_name = await ensure_awaitable(self.standard.name)
orig_brand = await ensure_awaitable(self.standard.brand)
return f"{orig_brand}: {orig_name}"
@field
async def new_price(self):
...
However, if items are similar, and share many attributes, this approach could lead to boilerplate code. For example, you might be extending an item with a new field, and it’d be required to duplicate definitions for all other fields.
Instead of using dependency injection you can make your Page Object a subclass of the original Page Object; that’s a nice way to add a new field to the item:
import attrs
from my_library import FooPage, StandardItem
from web_poet import field, Returns
@attrs.define
class CustomItem(StandardItem):
new_field: str
@attrs.define
class CustomFooPage(FooPage, Returns[CustomItem]):
@field
def new_field(self) -> str:
# ...
Note how Returns
is used as one of the base classes of
CustomFooPage
; it allows to change the item class returned by a page object.
Removing fields (as well as renaming) is a bit more tricky.
The caveat is that by default ItemPage
uses all fields
defined as @field
to produce an item, passing all these values to
item’s __init__
method. So, if you follow the previous example, and
inherit from the “base”, “standard” Page Object, there could be a @field
from the base class which is not present in the CustomItem
.
It’d be still passed to CustomItem.__init__
, causing an exception.
One way to solve it is to make the original Page Object a dependency instead of inheriting from it, as explained in the beginning.
Alternatively, you can use skip_nonitem_fields=True
class argument - it tells
to_item()
to skip @fields
which are not defined
in the item:
@attrs.define
class CustomItem:
# let's pick only 1 attribute from StandardItem, nothing more
name: str
class CustomFooPage(FooPage, Returns[CustomItem], skip_nonitem_fields=True):
pass
Here, CustomFooPage.to_item
only uses name
field of the FooPage
, ignoring
all other fields defined in FooPage
, because skip_nonitem_fields=True
is passed, and name
is the only field CustomItem
supports.
To recap:
Use
Returns[NewItemType]
to change the item class in a subclass.Don’t use
skip_nonitem_fields=True
when your Page Object corresponds to an item exactly, or when you’re only adding fields. This is a safe approach, which allows to detect typos in field names, even for optional fields.Use
skip_nonitem_fields=True
when it’s possible for the Page Object to contain more@fields
than defined in the item class, e.g. because Page Object is inherited from some other base Page Object.
Caching¶
When writing extraction code for Page Objects, it’s common that several attributes reuse some computation. For example, you might need to do an additional request to get an API response, and then fill several attributes from this response:
from web_poet import ItemPage, HttpResponse, HttpClient, validates_input
class MyPage(ItemPage):
response: HttpResponse
http: HttpClient
@validates_input
async def to_item(self):
api_url = self.response.css("...").get()
api_response = await self.http.get(api_url).json()
return {
'name': self.response.css(".name ::text").get(),
'price': api_response["price"],
'sku': api_response["sku"],
}
When converting such Page Objects to use fields, be careful not to make an API call (or some other heavy computation) multiple times. You can do it by extracting the heavy operation to a method, and caching the results:
from web_poet import ItemPage, HttpResponse, HttpClient, field, cached_method
class MyPage(ItemPage):
response: HttpResponse
http: HttpClient
@cached_method
async def api_response(self):
api_url = self.response.css("...").get()
return await self.http.get(api_url).json()
@field
def name(self):
return self.response.css(".name ::text").get()
@field
async def price(self):
api_response = await self.api_response()
return api_response["price"]
@field
async def sku(self):
api_response = await self.api_response()
return api_response["sku"]
As you can see, web-poet
provides cached_method()
decorator,
which allows to memoize the function results. It supports both sync and
async methods, i.e. you can use it on regular methods (def foo(self)
),
as well as on async methods (async def foo(self)
).
The refactored example, with per-attribute fields, is more verbose than
the original one, where a single to_item
method is used. However, it
provides some advantages — if only a subset of attributes is needed, then
it’s possible to use the Page Object without doing unnecessary work.
For example, if user only needs name
field in the example above, no
additional requests (API calls) will be made.
Sometimes you might want to cache a @field
, i.e. a property which computes
an attribute of the final item. In such cases, use @field(cached=True)
decorator instead of @field
.
cached_method
vs lru_cache
vs cached_property
¶
If you’re an experienced Python developer, you might wonder why is
cached_method()
decorator needed, if Python already provides
functools.lru_cache()
. For example, one can write this:
from functools import lru_cache
from web_poet import ItemPage
class MyPage(ItemPage):
# ...
@lru_cache
def heavy_method(self):
# ...
Don’t do it! There are two issues with functools.lru_cache()
, which make
it unsuitable here:
It doesn’t work properly on methods, because
self
is used as a part of the cache key. It means a reference to an instance is kept in the cache, and so created page objects are never deallocated, causing a memory leak.functools.lru_cache()
doesn’t work onasync def
methods, so you can’t cache e.g. results of API calls usingfunctools.lru_cache()
.
cached_method()
solves both of these issues. You may also use
functools.cached_property()
, or an external package like async_property
with async versions of @property
and @cached_property
decorators; unlike
functools.lru_cache()
, they all work fine for this use case.
Exceptions caching¶
Note that exceptions are not cached - neither by cached_method()
,
nor by @field(cached=True), nor by functools.lru_cache()
, nor by
functools.cached_property()
.
Usually it’s not an issue, because an exception is usually propagated, and so there are no duplicate calls anyways. But, just in case, keep this in mind.
Field metadata¶
web-poet
allows to store arbitrary information for each field, using
meta
keyword argument:
from web_poet import ItemPage, field
class MyPage(ItemPage):
@field(meta={"expensive": True})
async def my_field(self):
...
To retrieve this information, use web_poet.fields.get_fields_dict()
; it
returns a dictionary, where keys are field names, and values are
web_poet.fields.FieldInfo
instances.
from web_poet.fields import get_fields_dict
fields_dict = get_fields_dict(MyPage)
field_names = fields_dict.keys()
my_field_meta = fields_dict["my_field"].meta
print(field_names) # dict_keys(['my_field'])
print(my_field_meta) # {'expensive': True}
Input validation¶
Input validation, if used, happens before field evaluation, and it may override the values of fields, preventing field evaluation from ever happening. For example:
class Page(ItemPage[Item]):
def validate_input(self):
return Item(foo="bar")
@field
def foo(self):
raise RuntimeError("This exception is never raised")
assert Page().foo == "bar"
Field evaluation may still happen for a field if the field is used in the
implementation of the validate_input
method. Note, however, that only
synchronous fields can be used from the validate_input
method.