Fields

Background

It is common for Page Objects not to put all the extraction code to the to_item() method, but create properties or methods to extract individual attributes, a method or property per attribute:

import attrs
from web_poet import ItemPage, HttpResponse


@attrs.define
class MyPage(ItemPage):
    response: HttpResponse

    @property
    def name(self):
        return self.response.css(".name").get()

    @property
    def price(self):
        return self.response.css(".price").get()

    def to_item(self) -> dict:
        return {
            'name': self.name,
            'price': self.price
        }

This approach has 2 main advantages:

  1. Often the code looks cleaner this way, it’s easier to follow.

  2. The resulting page object becomes more flexible and reusable: if not all data extracted in the to_item() method is needed, user can use properties for individual attributes. It’s more efficient than running to_item() and only using some of the result.

However, writing and maintaining to_item() method can get tedious, especially if there is a lot of properties.

@field decorator

To aid writing Page Objects in this style, web-poet provides the @web_poet.field decorator:

import attrs
from web_poet import ItemPage, HttpResponse, field


@attrs.define
class MyPage(ItemPage):
    response: HttpResponse

    @field
    def name(self):
        return self.response.css(".name").get()

    @field
    def price(self):
        return self.response.css(".price").get()

ItemPage has a default to_item() implementation: it uses all the properties created with the @field decorator, and returns a dict with the result, where keys are method names, and values are property values. In the example above, to_item() returns a {"name": ..., "price": ...} dict with the extracted data.

Methods annotated with the @field decorator become properties; for a page = MyPage(...) instance you can access them as page.name.

It’s important to note that the default ItemPage.to_item() implementation is an async def function - make sure to await its result: item = await page.to_item()

Asynchronous fields

The reason ItemPage provides an async to_item method by default is that both regular and async def fields are supported.

For example, you might need to send Additional requests to extract some of the attributes:

import attrs
from web_poet import ItemPage, HttpResponse, HttpClient, field


@attrs.define
class MyPage(ItemPage):
    response: HttpResponse
    http: HttpClient

    @field
    def name(self):
        return self.response.css(".name").get()

    @field
    async def price(self):
        resp = await self.http.get("...")
        return resp.json()['price']

Using Page Objects with async fields

If you want to use a Page Object with async fields without calling its to_item method, make sure to await the field when needed, and not await it when that’s not needed:

page = MyPage(...)
name = page.name
price = await page.price

This is not ideal, because now the code which needs to use a page object must be aware if a field is sync or async. If a field needs to be changed from being sync to async def (or the other way around), e.g. because of a website change, all the code which uses this page object must be updated.

One approach to solve it is to always define all fields as async def. It works, but it makes the page objects harder to use in non-async environments.

Instead of doing this, you can also use ensure_awaitable() utility function when accessing the fields:

from web_poet.utils import ensure_awaitable

page = MyPage(...)
name = await ensure_awaitable(page.name)
price = await ensure_awaitable(page.price)

Now any field can be converted from sync to async, or the other way around, and the code would keep working.

Field processors

It’s often needed to clean or process field values using reusable functions. @field takes an optional out argument with a list of such functions. They will be applied to the field value before returning it:

from web_poet import ItemPage, HttpResponse, field

def clean_tabs(s):
    return s.replace('\t', ' ')

def add_brand(s, page):
    return f"{page.brand} - {s}"

class MyPage(ItemPage):
    response: HttpResponse

    @field(out=[clean_tabs, str.strip, add_brand])
    def name(self):
        return self.response.css(".name ::text").get()

    @field(cached=True)
    def brand(self):
        return self.response.css(".brand ::text").get()

If a processor takes an argument named page it will receive the page object instance in it so that values of other fields can be used. You should enable caching for fields accessed in processors to avoid unnecessary recomputations. Be careful of circular references, as accessing a field will run processors for it, and if two fields reference each other, RecursionError will be raised.

Note that while processors can be applied to async fields, they need to be sync functions themselves. This also means that only values of sync fields can be accessed in processors.

It’s also possible to implement field cleaning and processing in to_item but in that case accessing a field directly will return the value without processing, so it’s preferable to use field processors instead.

Default processors

You can also define processors on the page level by defining a nested class named Processors:

import attrs
from web_poet import ItemPage, HttpResponse, field

def clean_tabs(s):
    return s.replace('\t', ' ')

@attrs.define
class MyPage(ItemPage):
    response: HttpResponse

    class Processors:
        name = [clean_tabs, str.strip]

    @field
    def name(self):
        return self.response.css(".name ::text").get()

If Processors contains an attribute with the same name as a field, its value will be used as processors for that field unless the processors are specified in the out argument for it.

You can also reuse and extend the processors defined in a base class by explicitly accessing or subclassing the Processors class:

import attrs
from web_poet import ItemPage, HttpResponse, field

def clean_tabs(s):
    return s.replace('\t', ' ')

@attrs.define
class MyPage(ItemPage):
    response: HttpResponse

    class Processors:
        name = [str.strip]

    @field
    def name(self):
        return self.response.css(".name ::text").get()

class MyPage2(MyPage):
    class Processors(MyPage.Processors):
        # name uses the processors in MyPage.Processors.name
        # description now also uses them and also clean_tabs
        description = MyPage.Processors.name + [clean_tabs]

    @field
    def description(self):
        return self.response.css(".description ::text").get()

    # brand uses the same processors as name
    @field(out=MyPage.Processors.name)
    def brand(self):
        return self.response.css(".brand ::text").get()

Processors for nested fields

Some item fields contain nested items (e.g. a product can contain a list of variants) and it’s useful to have processors for fields of these nested items. You can use the same logic for them as for normal fields if you define an extractor class that produces these nested items. Such classes should inherit from Extractor. In the simplest cases you need to pass a selector to them:

import attrs
from parsel import Selector
from web_poet import Extractor, ItemPage, HttpResponse, field

@attrs.define
class MyPage(ItemPage):
    response: HttpResponse

    @field
    async def variants(self):
        variants = []
        for color_sel in self.response.css(".color"):
            variant = await VariantExtractor(color_sel).to_item()
            variants.append(variant)
        return variants

@attrs.define
class VariantExtractor(Extractor):
    sel: Selector

    @field(out=[str.strip])
    def color(self):
        return self.sel.css(".name::text")

In such cases you can also use SelectorExtractor as a shortcut that provides css() and xpath():

class VariantExtractor(SelectorExtractor):
    @field(out=[str.strip])
    def color(self):
        return self.css(".name::text")

You can also pass other data in addition to, or instead of, selectors, such as dictionaries with some data:

@attrs.define
class VariantExtractor(Extractor):
    variant_data: dict

    @field(out=[str.strip])
    def color(self):
        return self.variant_data["color"]

Item Classes

In all previous examples, to_item methods are returning dict instances. It is common to use item classes (e.g. dataclasses or attrs instances) instead of unstructured dicts to hold the data:

import attrs
from web_poet import ItemPage, HttpResponse, validates_input

@attrs.define
class Product:
    name: str
    price: str


@attrs.define
class ProductPage(ItemPage):
    # ...
    @validates_input
    def to_item(self) -> Product:
        return Product(
            name=self.name,
            price=self.price
        )

web_poet.fields supports it, by allowing to parametrize ItemPage with an item class:

@attrs.define
class ProductPage(ItemPage[Product]):
    # ...

When ItemPage is parametrized with an item class, its to_item() method starts to return item instances, instead of dict instances. In the example above ProductPage.to_item method returns Product instances.

Defining an item class may be an overkill if you only have a single Page Object, but item classes are of a great help when

  • you need to extract data in the same format from multiple websites, or

  • if you want to define the schema upfront.

Error prevention

Item classes play particularly well with the @field decorator, preventing some of the errors, which may happen if results are plain “dicts”.

Consider the following badly written page object:

import attrs
from web_poet import ItemPage, HttpResponse, field

@attrs.define
class Product:
    name: str
    price: str


@attrs.define
class ProductPage(ItemPage[Product]):
    response: HttpResponse

    @field
    def nane(self):
        return self.response.css(".name").get()

Because the Product item class is used, a typo (“nane” instead of “name”) is detected at runtime: the creation of a Product instance would fail with a TypeError, because of the unexpected keyword argument “nane”.

After fixing it (renaming “nane” method to “name”), another error is going to be detected: the price argument is required, but there is no extraction method for this attribute, so Product.__init__ will raise another TypeError, indicating that a required argument is missing.

Without an item class, none of these errors are detected.

Changing Item Class

Let’s say there is a Page Object implemented, which outputs some standard item. Maybe there is a library of such Page Objects available. But for a particular project we might want to output an item of a different type:

  • some attributes of the standard item might not be needed;

  • there might be a need to implement extra attributes, which are not available in the standard item;

  • names of attributes might be different.

There are a few ways to approach it. If items are very different, using the original Page Object as a dependency is a good approach:

import attrs
from my_library import FooPage, StandardItem
from web_poet import ItemPage, HttpResponse, field, ensure_awaitable

@attrs.define
class CustomItem:
    new_name: str
    new_price: str

@attrs.define
class CustomFooPage(ItemPage[CustomItem]):
    response: HttpResponse
    standard: FooPage

    @field
    async def new_name(self):
        orig_name = await ensure_awaitable(self.standard.name)
        orig_brand = await ensure_awaitable(self.standard.brand)
        return f"{orig_brand}: {orig_name}"

    @field
    async def new_price(self):
        ...

However, if items are similar, and share many attributes, this approach could lead to boilerplate code. For example, you might be extending an item with a new field, and it’d be required to duplicate definitions for all other fields.

Instead of using dependency injection you can make your Page Object a subclass of the original Page Object; that’s a nice way to add a new field to the item:

import attrs
from my_library import FooPage, StandardItem
from web_poet import field, Returns

@attrs.define
class CustomItem(StandardItem):
    new_field: str

@attrs.define
class CustomFooPage(FooPage, Returns[CustomItem]):

    @field
    def new_field(self) -> str:
        # ...

Note how Returns is used as one of the base classes of CustomFooPage; it allows to change the item class returned by a page object.

Removing fields (as well as renaming) is a bit more tricky.

The caveat is that by default ItemPage uses all fields defined as @field to produce an item, passing all these values to item’s __init__ method. So, if you follow the previous example, and inherit from the “base”, “standard” Page Object, there could be a @field from the base class which is not present in the CustomItem. It’d be still passed to CustomItem.__init__, causing an exception.

One way to solve it is to make the original Page Object a dependency instead of inheriting from it, as explained in the beginning.

Alternatively, you can use skip_nonitem_fields=True class argument - it tells to_item() to skip @fields which are not defined in the item:

@attrs.define
class CustomItem:
    # let's pick only 1 attribute from StandardItem, nothing more
    name: str

class CustomFooPage(FooPage, Returns[CustomItem], skip_nonitem_fields=True):
    pass

Here, CustomFooPage.to_item only uses name field of the FooPage, ignoring all other fields defined in FooPage, because skip_nonitem_fields=True is passed, and name is the only field CustomItem supports.

To recap:

  • Use Returns[NewItemType] to change the item class in a subclass.

  • Don’t use skip_nonitem_fields=True when your Page Object corresponds to an item exactly, or when you’re only adding fields. This is a safe approach, which allows to detect typos in field names, even for optional fields.

  • Use skip_nonitem_fields=True when it’s possible for the Page Object to contain more @fields than defined in the item class, e.g. because Page Object is inherited from some other base Page Object.

Caching

When writing extraction code for Page Objects, it’s common that several attributes reuse some computation. For example, you might need to do an additional request to get an API response, and then fill several attributes from this response:

from web_poet import ItemPage, HttpResponse, HttpClient, validates_input

class MyPage(ItemPage):
    response: HttpResponse
    http: HttpClient

    @validates_input
    async def to_item(self):
        api_url = self.response.css("...").get()
        api_response = await self.http.get(api_url).json()
        return {
            'name': self.response.css(".name ::text").get(),
            'price': api_response["price"],
            'sku': api_response["sku"],
        }

When converting such Page Objects to use fields, be careful not to make an API call (or some other heavy computation) multiple times. You can do it by extracting the heavy operation to a method, and caching the results:

from web_poet import ItemPage, HttpResponse, HttpClient, field, cached_method

class MyPage(ItemPage):
    response: HttpResponse
    http: HttpClient

    @cached_method
    async def api_response(self):
        api_url = self.response.css("...").get()
        return await self.http.get(api_url).json()

    @field
    def name(self):
        return self.response.css(".name ::text").get()

    @field
    async def price(self):
        api_response = await self.api_response()
        return api_response["price"]

    @field
    async def sku(self):
        api_response = await self.api_response()
        return api_response["sku"]

As you can see, web-poet provides cached_method() decorator, which allows to memoize the function results. It supports both sync and async methods, i.e. you can use it on regular methods (def foo(self)), as well as on async methods (async def foo(self)).

The refactored example, with per-attribute fields, is more verbose than the original one, where a single to_item method is used. However, it provides some advantages — if only a subset of attributes is needed, then it’s possible to use the Page Object without doing unnecessary work. For example, if user only needs name field in the example above, no additional requests (API calls) will be made.

Sometimes you might want to cache a @field, i.e. a property which computes an attribute of the final item. In such cases, use @field(cached=True) decorator instead of @field.

cached_method vs lru_cache vs cached_property

If you’re an experienced Python developer, you might wonder why is cached_method() decorator needed, if Python already provides functools.lru_cache(). For example, one can write this:

from functools import lru_cache
from web_poet import ItemPage

class MyPage(ItemPage):
    # ...
    @lru_cache
    def heavy_method(self):
        # ...

Don’t do it! There are two issues with functools.lru_cache(), which make it unsuitable here:

  1. It doesn’t work properly on methods, because self is used as a part of the cache key. It means a reference to an instance is kept in the cache, and so created page objects are never deallocated, causing a memory leak.

  2. functools.lru_cache() doesn’t work on async def methods, so you can’t cache e.g. results of API calls using functools.lru_cache().

cached_method() solves both of these issues. You may also use functools.cached_property(), or an external package like async_property with async versions of @property and @cached_property decorators; unlike functools.lru_cache(), they all work fine for this use case.

Exceptions caching

Note that exceptions are not cached - neither by cached_method(), nor by @field(cached=True), nor by functools.lru_cache(), nor by functools.cached_property().

Usually it’s not an issue, because an exception is usually propagated, and so there are no duplicate calls anyways. But, just in case, keep this in mind.

Field metadata

web-poet allows to store arbitrary information for each field, using meta keyword argument:

from web_poet import ItemPage, field

class MyPage(ItemPage):

    @field(meta={"expensive": True})
    async def my_field(self):
        ...

To retrieve this information, use web_poet.fields.get_fields_dict(); it returns a dictionary, where keys are field names, and values are web_poet.fields.FieldInfo instances.

from web_poet.fields import get_fields_dict

fields_dict = get_fields_dict(MyPage)
field_names = fields_dict.keys()
my_field_meta = fields_dict["my_field"].meta

print(field_names)  # dict_keys(['my_field'])
print(my_field_meta)  # {'expensive': True}

Input validation

Input validation, if used, happens before field evaluation, and it may override the values of fields, preventing field evaluation from ever happening. For example:

class Page(ItemPage[Item]):
    def validate_input(self):
        return Item(foo="bar")

    @field
    def foo(self):
        raise RuntimeError("This exception is never raised")

 assert Page().foo == "bar"

Field evaluation may still happen for a field if the field is used in the implementation of the validate_input method. Note, however, that only synchronous fields can be used from the validate_input method.