Fields¶
A field is a read-only property in a page object class
decorated with @field
instead of
@property
.
Each field is named after a key of the item that the page object class returns. A field uses the inputs of its page object class to return the right value for the matching item key.
For example:
from typing import Optional
import attrs
from web_poet import ItemPage, HttpResponse, field
@attrs.define
class MyPage(ItemPage):
response: HttpResponse
@field
def foo(self) -> Optional[str]:
return self.response.css(".foo").get()
Synchronous and asynchronous fields¶
Fields can be either synchronous (def
) or asynchronous (async def
).
Asynchronous fields make sense, for example, when sending additional requests:
from typing import Optional
import attrs
from web_poet import ItemPage, HttpClient, HttpResponse, field
@attrs.define
class MyPage(ItemPage):
response: HttpResponse
http: HttpClient
@field
def name(self) -> Optional[str]:
return self.response.css(".name").get()
@field
async def price(self) -> Optional[str]:
resp = await self.http.get("...")
return resp.json().get("price")
Unlike the values of synchronous fields, the values of asynchronous fields need to be awaited:
page = MyPage(...)
name = page.name
price = await page.price
Mixing synchronous and asynchronous fields can be messy:
You need to know whether a field is synchronous or asynchronous to write the right code to read its value.
If a field changes from synchronous to asynchronous or vice versa, calls that read the field need to be updated.
Changing from synchronous to asynchronous might be sometimes necessary due to website changes (e.g. needing additional requests).
To address these issues, use ensure_awaitable()
to read both
synchronous and asynchronous fields with the same code:
from web_poet.utils import ensure_awaitable
page = MyPage(...)
name = await ensure_awaitable(page.name)
price = await ensure_awaitable(page.price)
Note
Using asynchronous fields only also works, but prevents accessing other fields from field processors.
Inheritance¶
To create a page object class that is very similar to another, subclassing the former page object class is often a good approach to maximize code reuse.
In a subclass of a page object class you can reimplement fields, add fields, remove fields, or rename fields.
Reimplementing a field¶
Reimplementing a field when subclassing a page object class should be straightforward:
import attrs
from web_poet import field, ensure_awaitable
from my_library import BasePage
@attrs.define
class CustomPage(BasePage):
@field
async def foo(self) -> str:
base_foo = await ensure_awaitable(super().foo)
return f"{base_foo} (modified)"
Adding a field¶
To add a new field to a page object class when subclassing:
Define a new item class that includes the new field, for example a subclass of the item class returned by the original page object class.
In your new page object class, subclass both the original page object class and
Returns
, the latter including the new item class between brackets.Implement the extraction code for the new field in the new page object class.
For example:
import attrs
from web_poet import field, Returns
from my_library import BasePage, BaseItem
@attrs.define
class CustomItem(BaseItem):
new_field: str
@attrs.define
class CustomPage(BasePage, Returns[CustomItem]):
@field
def new_field(self) -> str:
...
Removing a field¶
To remove a field from a page object class when subclassing:
Define a new item class that defines all fields but the one being removed.
In your new page object class, subclass the original page object class,
Returns
with the new item class between brackets, and setskip_nonitem_fields=True
.When building an item, page object class fields without a matching item class field will now be ignored, rather than raising an exception.
Your new page object class will still define the field, but the resulting item will not.
For example:
import attrs
from web_poet import Returns
from my_library import BasePage
@attrs.define
class CustomItem:
kept_field: str
@attrs.define
class CustomPage(BasePage, Returns[CustomItem], skip_nonitem_fields=True):
pass
Alternatively, you can consider using a page object as input for removing fields. It is more verbose than subclassing,
because you need to define every field in your page object class, but it can
catch some mismatches between page object class fields and item class fields
that would otherwise be hidden by skip_nonitem_fields
.
Renaming a field¶
To rename a field from a page object class when subclassing:
Define a new item class that defines all fields, including the renamed field.
In your new page object class, subclass the original page object class,
Returns
with the new item class between brackets, and setskip_nonitem_fields=True
.When building an item, page object class fields without a matching item class field will now be ignored, rather than raising an exception.
Define a field for the new field name that returns the value from the old field name.
Your new page object class will still define the old field name, but the resulting item will not.
For example:
import attrs
from web_poet import Returns
from my_library import BasePage
@attrs.define
class CustomItem:
new_field: str
@attrs.define
class CustomPage(BasePage, Returns[CustomItem], skip_nonitem_fields=True):
@field
async def new_field(self) -> str:
return ensure_awaitable(self.old_field)
Alternatively, you can consider using a page object as input for renaming fields. It is more verbose than subclassing,
because you need to define every field in your page object class, but it can
catch some mismatches between page object class fields and item class fields
that would otherwise be hidden by skip_nonitem_fields
.
Composition¶
There are 2 forms of composition that you can use when writing a page object: using a page object as input, and using a field mixing.
Using a page object as input¶
You can reuse a page object class from another page object class using composition instead of inheritance by using the original page object class as a dependency in a brand new page object class returning a brand new item class.
This is a good approach when you want to reuse code but the page object classes
are very different, or when you want to remove or rename fields without relying
on skip_nonitem_fields
.
For example:
import attrs
from web_poet import ItemPage, field, ensure_awaitable
from my_library import BasePage
@attrs.define
class CustomItem:
name: str
@attrs.define
class CustomPage(ItemPage[CustomItem]):
base: BasePage
@field
async def name(self) -> str:
name = await ensure_awaitable(self.base.name)
brand = await ensure_awaitable(self.base.brand)
return f"{brand}: {name}"
Instead of a page object, it is possible to declare the item it returns as a dependency in your new page object class. For example:
import attrs
from web_poet import ItemPage, field
from my_library import BaseItem
@attrs.define
class CustomItem:
name: str
@attrs.define
class CustomPage(ItemPage[CustomItem]):
base: BaseItem
@field
def name(self) -> str:
return f"{self.base.brand}: {self.base.name}"
This gives you the flexibility to use rules to set the page object class to use when building the item. Also, item fields can be read from synchronous methods even if the source page object fields were asynchronous.
On the other hand, all fields of the source page object class will always be called to build the entire item, which may be a waste of resources if you only need to access some of the item fields.
Field mixins¶
You can subclass web_poet.fields.FieldsMixin
to create a mixin to
reuse field definitions across multiple, otherwise-unrelated classes. For
example:
import attrs
from web_poet import ItemPage, field
from web_poet.fields import FieldsMixin
from my_library import BaseItem1, BaseItem2
@attrs.define
class CustomItem:
name: str
class NameMixin(FieldsMixin):
@field
def name(self) -> str:
return f"{self.base.brand}: {self.base.name}"
@attrs.define
class CustomPage1(NameMixin, ItemPage[CustomItem]):
base: BaseItem1
@attrs.define
class CustomPage2(NameMixin, ItemPage[CustomItem]):
base: BaseItem2
Field processors¶
It’s often needed to clean or process field values using reusable functions.
@field
takes an optional out
argument with
a list of such functions. They will be applied to the field value before
returning it:
from web_poet import ItemPage, HttpResponse, field
def clean_tabs(s: str) -> str:
return s.replace('\t', ' ')
def add_brand(s: str, page: ItemPage) -> str:
return f"{page.brand} - {s}"
class MyPage(ItemPage):
response: HttpResponse
@field(out=[clean_tabs, str.strip, add_brand])
def name(self) -> str:
return self.response.css(".name ::text").get() or ""
@field(cached=True)
def brand(self) -> str:
return self.response.css(".brand ::text").get() or ""
Accessing other fields from field processors¶
If a processor takes an argument named page
, that argument will contain the
page object instance. This allows processing a field differently based on the
values of other fields.
Be careful of circular references. Accessing a field runs its processors; if
two fields reference each other, RecursionError
will be raised.
You should enable caching for fields accessed in processors, to avoid unnecessary recomputation.
Processors can be applied to asynchronous fields, but processor functions must
be synchronous. As a result, only values of synchronous fields can be accessed
from processors through the page
argument.
Default processors¶
In addition to the out
argument of @field
,
you can define processors at the page object class level by defining a nested
class named Processors
:
import attrs
from web_poet import ItemPage, HttpResponse, field
def clean_tabs(s: str) -> str:
return s.replace('\t', ' ')
@attrs.define
class MyPage(ItemPage):
response: HttpResponse
class Processors:
name = [clean_tabs, str.strip]
@field
def name(self) -> str:
return self.response.css(".name ::text").get() or ""
If Processors
contains an attribute with the same name as a field, the
value of that attribute is used as a list of default processors for the field,
to be used if the out
argument of @field
is
not defined.
You can also reuse and extend the processors defined in a base class by
explicitly accessing or subclassing the Processors
class:
import attrs
from web_poet import ItemPage, HttpResponse, field
def clean_tabs(s: str) -> str:
return s.replace('\t', ' ')
@attrs.define
class MyPage(ItemPage):
response: HttpResponse
class Processors:
name = [str.strip]
@field
def name(self) -> str:
return self.response.css(".name ::text").get() or ""
class MyPage2(MyPage):
class Processors(MyPage.Processors):
# name uses the processors in MyPage.Processors.name
# description now also uses them and also clean_tabs
description = MyPage.Processors.name + [clean_tabs]
@field
def description(self) -> str:
return self.response.css(".description ::text").get() or ""
# brand uses the same processors as name
@field(out=MyPage.Processors.name)
def brand(self) -> str:
return self.response.css(".brand ::text").get() or ""
Processors for nested fields¶
Some item fields contain nested items (e.g. a product can contain a list of variants) and it’s useful to have processors for fields of these nested items.
You can use the same logic for them as for normal fields if you define an
extractor class that produces these nested items. Such classes should inherit
from Extractor
.
In the simplest cases you need to pass a selector to them:
from typing import Any, Dict, List
import attrs
from parsel import Selector
from web_poet import Extractor, ItemPage, HttpResponse, field
@attrs.define
class MyPage(ItemPage):
response: HttpResponse
@field
async def variants(self) -> List[Dict[str, Any]]:
variants = []
for color_sel in self.response.css(".color"):
variant = await VariantExtractor(color_sel).to_item()
variants.append(variant)
return variants
@attrs.define
class VariantExtractor(Extractor):
sel: Selector
@field(out=[str.strip])
def color(self) -> str:
return self.sel.css(".name::text").get() or ""
In such cases you can also use SelectorExtractor
as a shortcut that
provides css()
and xpath()
:
class VariantExtractor(SelectorExtractor):
@field(out=[str.strip])
def color(self) -> str:
return self.css(".name::text").get() or ""
You can also pass other data in addition to, or instead of, selectors, such as dictionaries with some data:
@attrs.define
class VariantExtractor(Extractor):
variant_data: dict
@field(out=[str.strip])
def color(self) -> str:
return self.variant_data.get("color") or ""
Field caching¶
When writing extraction code for Page Objects, it’s common that several attributes reuse some computation. For example, you might need to do an additional request to get an API response, and then fill several attributes from this response:
from typing import Dict, Optional
from web_poet import ItemPage, HttpResponse, HttpClient, validates_input
class MyPage(ItemPage):
response: HttpResponse
http: HttpClient
@validates_input
async def to_item(self) -> Dict[str, Optional[str]]:
api_url = self.response.css("...").get()
api_response = await self.http.get(api_url).json()
return {
'name': self.response.css(".name ::text").get(),
'price': api_response.get("price"),
'sku': api_response.get("sku"),
}
When converting such Page Objects to use fields, be careful not to make an API call (or some other heavy computation) multiple times. You can do it by extracting the heavy operation to a method, and caching the results:
from typing import Dict
from web_poet import ItemPage, HttpResponse, HttpClient, field, cached_method
class MyPage(ItemPage):
response: HttpResponse
http: HttpClient
@cached_method
async def api_response(self) -> Dict[str, str]:
api_url = self.response.css("...").get()
return await self.http.get(api_url).json()
@field
def name(self) -> str:
return self.response.css(".name ::text").get() or ""
@field
async def price(self) -> str:
api_response = await self.api_response()
return api_response.get("price") or ""
@field
async def sku(self) -> str:
api_response = await self.api_response()
return api_response.get("sku") or ""
As you can see, web-poet
provides cached_method()
decorator,
which allows to memoize the function results. It supports both sync and
async methods, i.e. you can use it on regular methods (def foo(self)
),
as well as on async methods (async def foo(self)
).
The refactored example, with per-attribute fields, is more verbose than
the original one, where a single to_item
method is used. However, it
provides some advantages — if only a subset of attributes is needed, then
it’s possible to use the Page Object without doing unnecessary work.
For example, if user only needs name
field in the example above, no
additional requests (API calls) will be made.
Sometimes you might want to cache a @field
, i.e. a property which computes
an attribute of the final item. In such cases, use @field(cached=True)
decorator instead of @field
.
cached_method
vs lru_cache
vs cached_property
¶
If you’re an experienced Python developer, you might wonder why is
cached_method()
decorator needed, if Python already provides
functools.lru_cache()
. For example, one can write this:
from functools import lru_cache
from web_poet import ItemPage
class MyPage(ItemPage):
# ...
@lru_cache
def heavy_method(self):
# ...
Don’t do it! There are two issues with functools.lru_cache()
, which make
it unsuitable here:
It doesn’t work properly on methods, because
self
is used as a part of the cache key. It means a reference to an instance is kept in the cache, and so created page objects are never deallocated, causing a memory leak.functools.lru_cache()
doesn’t work onasync def
methods, so you can’t cache e.g. results of API calls usingfunctools.lru_cache()
.
cached_method()
solves both of these issues. You may also use
functools.cached_property()
, or an external package like async_property
with async versions of @property
and @cached_property
decorators; unlike
functools.lru_cache()
, they all work fine for this use case.
Exception caching¶
Note that exceptions are not cached - neither by cached_method()
,
nor by @field(cached=True), nor by functools.lru_cache()
, nor by
functools.cached_property()
.
Usually it’s not an issue, because an exception is usually propagated, and so there are no duplicate calls anyways. But, just in case, keep this in mind.
Field metadata¶
web-poet
allows to store arbitrary information for each field using the
meta
keyword argument:
from web_poet import ItemPage, field
class MyPage(ItemPage):
@field(meta={"expensive": True})
async def my_field(self):
...
To retrieve this information, use web_poet.fields.get_fields_dict()
; it
returns a dictionary, where keys are field names, and values are
web_poet.fields.FieldInfo
instances.
from web_poet.fields import get_fields_dict
fields_dict = get_fields_dict(MyPage)
field_names = fields_dict.keys()
my_field_meta = fields_dict["my_field"].meta
print(field_names) # dict_keys(['my_field'])
print(my_field_meta) # {'expensive': True}
Input validation¶
Input validation, if used, happens before field evaluation, and it may override the values of fields, preventing field evaluation from ever happening. For example:
class Page(ItemPage[Item]):
def validate_input(self) -> Item:
return Item(foo="bar")
@field
def foo(self):
raise RuntimeError("This exception is never raised")
assert Page().foo == "bar"
Field evaluation may still happen for a field if the field is used in the
implementation of the validate_input
method. Note, however, that only
synchronous fields can be used from the validate_input
method.