API reference

Page Inputs

class web_poet.page_inputs.browser.BrowserHtml[source]

Bases: SelectableMixin, str

HTML returned by a web browser, i.e. snapshot of the DOM tree in HTML format.

css(query) → SelectorList: A shortcut to .selector.css().

jmespath(query: str, **kwargs) → SelectorList: A shortcut to .selector.jmespath().

property selector: Selector: Cached instance of parsel.selector.Selector.

xpath(query, **kwargs) → SelectorList: A shortcut to .selector.xpath().

class web_poet.page_inputs.browser.BrowserResponse(url: str | _Url, html, *, status: int | None = None)[source]

Bases: SelectableMixin, UrlShortcutsMixin

Browser response: url, HTML and status code.

url should be browser’s window.location, not a URL of the request, if possible.

html contains the HTML returned by the browser, i.e. a snapshot of DOM tree in HTML format.

The following are optional since it would depend on the source of the BrowserResponse if these are available or not:

status should represent the int status code of the HTTP response.

url: ResponseUrl

html: BrowserHtml

status: int | None

property text: str

HTML returned by the browser, identical to self.html.

Provided for compatibility with HttpResponse.

css(query) → SelectorList: A shortcut to .selector.css().

jmespath(query: str, **kwargs) → SelectorList: A shortcut to .selector.jmespath().

property selector: Selector: Cached instance of parsel.selector.Selector.

urljoin(url: str | RequestUrl | ResponseUrl) → RequestUrl

Return url as an absolute URL.

If url is relative, it is made absolute relative to the base URL of self.

xpath(query, **kwargs) → SelectorList: A shortcut to .selector.xpath().

class web_poet.page_inputs.client.HttpClient(request_downloader: RequestDownloaderT | None = None, *, save_responses: bool = False, return_only_saved_responses: bool = False, responses: Iterable[_SavedResponseData] | None = None)[source]

Async HTTP client to be used in Page Objects.

See Additional requests for the usage information.

HttpClient doesn’t make HTTP requests on itself. It uses either the request function assigned to the web_poet.request_downloader_var contextvar, or a function passed via request_downloader argument of the __init__() method.

Either way, this function should be an async def function which receives an HttpRequest instance, and either returns a HttpResponse instance, or raises a subclass of HttpError. You can read more in the Providing the Downloader documentation.

This is a shortcut for creating an HttpRequest instance and executing that request.

HttpRequestError is raised for connection errors, connection and read timeouts, etc.

An HttpResponse instance is returned for successful responses in the 100-3xx status code range.

Otherwise, an exception of type HttpResponseError is raised.

Rasing HttpResponseError can be suppressed for certain status codes using the allow_status param - it is a list of status code values for which HttpResponse should be returned instead of raising HttpResponseError.

There is a special “*” allow_status value which allows any status code.

There is no need to include 100-3xx status codes in allow_status, because HttpResponseError is not raised for them.

async get(url: str | _Url, *, headers: dict[str, str] | HttpRequestHeaders | None = None, allow_status: str | int | list[str | int] | None = None) → HttpResponse[source]: Similar to request() but peforming a GET request.

async post(url: str | _Url, *, headers: dict[str, str] | HttpRequestHeaders | None = None, body: bytes | HttpRequestBody | None = None, allow_status: str | int | list[str | int] | None = None) → HttpResponse[source]: Similar to request() but performing a POST request.

async execute(request: HttpRequest, *, allow_status: str | int | list[str | int] | None = None) → HttpResponse[source]

Execute the specified HttpRequest instance using the request implementation configured in the HttpClient instance.

HttpRequestError is raised for connection errors, connection and read timeouts, etc.

HttpResponse instance is returned for successful responses in the 100-3xx status code range.

Otherwise, an exception of type HttpResponseError is raised.

Rasing HttpResponseError can be suppressed for certain status codes using the allow_status param - it is a list of status code values for which HttpResponse should be returned instead of raising HttpResponseError.

There is a special “*” allow_status value which allows any status code.

There is no need to include 100-3xx status codes in allow_status, because HttpResponseError is not raised for them.

Similar to execute() but accepts a collection of HttpRequest instances that would be batch executed.

The order of the HttpResponses would correspond to the order of HttpRequest passed.

If any of the HttpRequest raises an exception upon execution, the exception is raised.

To prevent this, the actual exception can be returned alongside any successful HttpResponse. This enables salvaging any usable responses despite any possible failures. This can be done by setting True to the return_exceptions parameter.

Like execute(), HttpResponseError will be raised for responses with status codes in the 400-5xx range. The allow_status parameter could be used the same way here to prevent these exceptions from being raised.

You can omit allow_status="*" if you’re passing return_exceptions=True. However, it would be returning HttpResponseError instead of HttpResponse.

Lastly, a HttpRequestError may be raised on cases like connection errors, connection and read timeouts, etc.

get_saved_responses() → Iterable[_SavedResponseData][source]: Return saved requests and responses.

class web_poet.page_inputs.http.HttpRequestBody[source]

Bases: bytes

A container for holding the raw HTTP request body in bytes format.

class web_poet.page_inputs.http.HttpResponseBody[source]

Bases: bytes

A container for holding the raw HTTP response body in bytes format.

bom_encoding() → str | None[source]: Returns the encoding from the byte order mark if present.

declared_encoding() → str | None[source]: Return the encoding specified in meta tags in the html body, or None if no suitable encoding was found

json() → Any[source]: Deserialize a JSON document to a Python object.

class web_poet.page_inputs.http.HttpRequestHeaders[source]

Bases: _HttpHeaders

A container for holding the HTTP request headers.

It’s able to accept instantiation via an Iterable of Tuples:

>>> pairs = [("Content-Encoding", "gzip"), ("content-length", "648")]
>>> HttpRequestHeaders(pairs)
<HttpRequestHeaders('Content-Encoding': 'gzip', 'content-length': '648')>

It’s also accepts a mapping of key-value pairs as well:

>>> pairs = {"Content-Encoding": "gzip", "content-length": "648"}
>>> headers = HttpRequestHeaders(pairs)
>>> headers
<HttpRequestHeaders('Content-Encoding': 'gzip', 'content-length': '648')>

Note that this also supports case insensitive header-key lookups:

>>> headers.get("content-encoding")
'gzip'
>>> headers.get("Content-Length")
'648'

These are just a few of the functionalities it inherits from multidict.CIMultiDict. For more info on its other features, read the API spec of multidict.CIMultiDict.

classmethod from_bytes_dict(arg: _AnyStrDict, encoding: str = 'utf-8') → Self

An alternative constructor for instantiation where the header-value pairs could be in raw bytes form.

This supports multiple header values in the form of List[bytes] and Tuple[bytes]] alongside a plain bytes value. A value in str also works and wouldn’t break the decoding process at all.

By default, it converts the bytes value using “utf-8”. However, this can easily be overridden using the encoding parameter.

>>> raw_values = {
...     b"Content-Encoding": [b"gzip", b"br"],
...     b"Content-Type": [b"text/html"],
...     b"content-length": b"648",
... }
>>> headers = _HttpHeaders.from_bytes_dict(raw_values)
>>> headers
<_HttpHeaders('Content-Encoding': 'gzip', 'Content-Encoding': 'br', 'Content-Type': 'text/html', 'content-length': '648')>

classmethod from_name_value_pairs(arg: list[dict]) → Self

An alternative constructor for instantiation using a List[Dict] where the ‘key’ is the header name while the ‘value’ is the header value.

>>> pairs = [
...     {"name": "Content-Encoding", "value": "gzip"},
...     {"name": "content-length", "value": "648"}
... ]
>>> headers = _HttpHeaders.from_name_value_pairs(pairs)
>>> headers
<_HttpHeaders('Content-Encoding': 'gzip', 'content-length': '648')>

class web_poet.page_inputs.http.HttpResponseHeaders[source]

Bases: _HttpHeaders

A container for holding the HTTP response headers.

It’s able to accept instantiation via an Iterable of Tuples:

>>> pairs = [("Content-Encoding", "gzip"), ("content-length", "648")]
>>> HttpResponseHeaders(pairs)
<HttpResponseHeaders('Content-Encoding': 'gzip', 'content-length': '648')>

It’s also accepts a mapping of key-value pairs as well:

>>> pairs = {"Content-Encoding": "gzip", "content-length": "648"}
>>> headers = HttpResponseHeaders(pairs)
>>> headers
<HttpResponseHeaders('Content-Encoding': 'gzip', 'content-length': '648')>

Note that this also supports case insensitive header-key lookups:

>>> headers.get("content-encoding")
'gzip'
>>> headers.get("Content-Length")
'648'

These are just a few of the functionalities it inherits from multidict.CIMultiDict. For more info on its other features, read the API spec of multidict.CIMultiDict.

declared_encoding() → str | None[source]: Return encoding detected from the Content-Type header, or None if encoding is not found

classmethod from_bytes_dict(arg: _AnyStrDict, encoding: str = 'utf-8') → Self

An alternative constructor for instantiation where the header-value pairs could be in raw bytes form.

This supports multiple header values in the form of List[bytes] and Tuple[bytes]] alongside a plain bytes value. A value in str also works and wouldn’t break the decoding process at all.

By default, it converts the bytes value using “utf-8”. However, this can easily be overridden using the encoding parameter.

>>> raw_values = {
...     b"Content-Encoding": [b"gzip", b"br"],
...     b"Content-Type": [b"text/html"],
...     b"content-length": b"648",
... }
>>> headers = _HttpHeaders.from_bytes_dict(raw_values)
>>> headers
<_HttpHeaders('Content-Encoding': 'gzip', 'Content-Encoding': 'br', 'Content-Type': 'text/html', 'content-length': '648')>

classmethod from_name_value_pairs(arg: list[dict]) → Self

An alternative constructor for instantiation using a List[Dict] where the ‘key’ is the header name while the ‘value’ is the header value.

>>> pairs = [
...     {"name": "Content-Encoding", "value": "gzip"},
...     {"name": "content-length", "value": "648"}
... ]
>>> headers = _HttpHeaders.from_name_value_pairs(pairs)
>>> headers
<_HttpHeaders('Content-Encoding': 'gzip', 'content-length': '648')>

class web_poet.page_inputs.http.HttpRequest(url: str | _Url, *, method: str = 'GET', headers=NOTHING, body=NOTHING)[source]

Bases: object

Represents a generic HTTP request used by other functionalities in web-poet like HttpClient.

Tip

To build a request to submit an HTML form, use the form2request library, which provides integration with web-poet.

url: RequestUrl

method: str

headers: HttpRequestHeaders

body: HttpRequestBody

urljoin(url: str | RequestUrl | ResponseUrl) → RequestUrl[source]

Return url as an absolute URL.

If url is relative, it is made absolute relative to url.

class web_poet.page_inputs.http.HttpResponse(url: str | _Url, body, *, status: int | None = None, headers=NOTHING, encoding: str | None = None)[source]

Bases: SelectableMixin, UrlShortcutsMixin

A container for the contents of a response, downloaded directly using an HTTP client.

url should be a URL of the response (after all redirects), not a URL of the request, if possible.

body contains the raw HTTP response body.

The following are optional since it would depend on the source of the HttpResponse if these are available or not. For example, the responses could simply come off from a local HTML file which doesn’t contain headers and status.

status should represent the int status code of the HTTP response.

headers should contain the HTTP response headers.

encoding encoding of the response. If None (default), encoding is auto-detected from headers and body content.

url: ResponseUrl

body: HttpResponseBody

status: int | None

headers: HttpResponseHeaders

property text: str: Content of the HTTP body, converted to unicode using the detected encoding of the response, according to the web browser rules (respecting Content-Type header, etc.)

property encoding: str | None: Encoding of the response

json() → Any[source]: Deserialize a JSON document to a Python object.

css(query) → SelectorList: A shortcut to .selector.css().

jmespath(query: str, **kwargs) → SelectorList: A shortcut to .selector.jmespath().

property selector: Selector: Cached instance of parsel.selector.Selector.

urljoin(url: str | RequestUrl | ResponseUrl) → RequestUrl

Return url as an absolute URL.

If url is relative, it is made absolute relative to the base URL of self.

xpath(query, **kwargs) → SelectorList: A shortcut to .selector.xpath().

web_poet.page_inputs.http.request_fingerprint(req: HttpRequest) → str[source]: Return the fingerprint of the request.

class web_poet.page_inputs.response.AnyResponse(response: BrowserResponse | HttpResponse)[source]

Bases: SelectableMixin, UrlShortcutsMixin

A container that holds either BrowserResponse or HttpResponse.

response: BrowserResponse | HttpResponse

property url: ResponseUrl: URL of the response.

property text: str: Text or HTML contents of the response.

property status: int | None: The int status code of the HTTP response, if available.

css(query) → SelectorList: A shortcut to .selector.css().

jmespath(query: str, **kwargs) → SelectorList: A shortcut to .selector.jmespath().

property selector: Selector: Cached instance of parsel.selector.Selector.

urljoin(url: str | RequestUrl | ResponseUrl) → RequestUrl

Return url as an absolute URL.

If url is relative, it is made absolute relative to the base URL of self.

xpath(query, **kwargs) → SelectorList: A shortcut to .selector.xpath().

class web_poet.page_inputs.page_params.PageParams[source]

Bases: dict[_KT, _VT]

Container class that could contain any arbitrary data to be passed into a Page Object.

Note that this is simply a subclass of Python’s dict.

class web_poet.page_inputs.stats.StatCollector[source]

Bases: ABC

Base class for web-poet to implement the storing of data written through Stats.

abstractmethod set(key: str, value: Any) → None[source]: Set the value of stat key to value.

abstractmethod inc(key: str, value: int | float = 1) → None[source]: Increment the value of stat key by value, or set it to value if key has no value.

class web_poet.page_inputs.stats.DummyStatCollector[source]

Bases: StatCollector

StatCollector implementation that does not persist stats. It is used when running automatic tests, where stat storage is not necessary.

set(key: str, value: Any) → None[source]: Set the value of stat key to value.

inc(key: str, value: int | float = 1) → None[source]: Increment the value of stat key by value, or set it to value if key has no value.

class web_poet.page_inputs.stats.DictStatCollector[source]

Bases: DummyStatCollector

Simple StatCollector implementation that stores stats in a dict accessible through the data property.

property data: dict[str, Any]: Dictionary data.

class web_poet.page_inputs.stats.Stats(stat_collector: StatCollector | None = None)[source]

Bases: object

Page input class to write key-value data pairs during parsing that you can inspect later. See Stats.

Stats can be set to a fixed value or, if numeric, incremented.

Stats are write-only.

Storage and read access of stats depends on the web-poet framework that you are using. Check the documentation of your web-poet framework to find out if it supports stats, and if so, how to read stored stats.

set(key: str, value: Any) → None[source]: Set the value of stat key to value.

inc(key: str, value: int | float = 1) → None[source]: Increment the value of stat key by value, or set it to value if key has no value.

Pages

class web_poet.pages.Injectable[source]

Bases: ABC, FieldsMixin

Base Page Object class, which all Page Objects should inherit from (probably through Injectable subclasses).

Frameworks which are using web-poet Page Objects should use is_injectable() function to detect if an object is an Injectable, and if an object is injectable, allow building it automatically through dependency injection, using https://github.com/scrapinghub/andi library.

Instead of inheriting you can also use Injectable.register(MyWebPage). Injectable.register can also be used as a decorator.

web_poet.pages.is_injectable(cls: Any) → bool[source]: Return True if cls is a class which inherits from Injectable.

class web_poet.pages.ItemPage[source]

Bases: Extractor[ItemT], Injectable

Base class for page objects.

async to_item() → ItemT[source]: Extract an item from a web page

class web_poet.pages.WebPage(response: HttpResponse)[source]

Bases: ItemPage[ItemT], ResponseShortcutsMixin

Base Page Object which requires HttpResponse and provides XPath / CSS shortcuts.

response: HttpResponse

property base_url: str: Return the base url of the given response

css(query) → SelectorList: A shortcut to .selector.css().

property html: str: Shortcut to HTML Response’s content.

property item_cls: type: Item class

jmespath(query: str, **kwargs) → SelectorList: A shortcut to .selector.jmespath().

property selector: Selector: Cached instance of parsel.selector.Selector.

async to_item() → ItemT: Extract an item from a web page

property url: str: Shortcut to HTML Response’s URL, as a string.

urljoin(url: str) → str: Convert url to absolute, taking in account url and baseurl of the response

xpath(query, **kwargs) → SelectorList: A shortcut to .selector.xpath().

class web_poet.pages.BrowserPage(response: BrowserResponse)[source]

Bases: ItemPage[ItemT], ResponseShortcutsMixin

Base Page Object which requires BrowserResponse and provides XPath / CSS shortcuts.

response: BrowserResponse

property base_url: str: Return the base url of the given response

css(query) → SelectorList: A shortcut to .selector.css().

property html: str: Shortcut to HTML Response’s content.

property item_cls: type: Item class

jmespath(query: str, **kwargs) → SelectorList: A shortcut to .selector.jmespath().

property selector: Selector: Cached instance of parsel.selector.Selector.

async to_item() → ItemT: Extract an item from a web page

property url: str: Shortcut to HTML Response’s URL, as a string.

urljoin(url: str) → str: Convert url to absolute, taking in account url and baseurl of the response

xpath(query, **kwargs) → SelectorList: A shortcut to .selector.xpath().

class web_poet.pages.Returns[source]

Bases: Generic[ItemT]

Inherit from this generic mixin to change the item class used by ItemPage

property item_cls: type: Item class

class web_poet.pages.Extractor[source]

Bases: Returns[ItemT], FieldsMixin

Base class for field support.

async to_item() → ItemT[source]: Extract an item

class web_poet.pages.SelectorExtractor(selector: Selector)[source]

Bases: Extractor[ItemT], SelectorShortcutsMixin

Extractor that takes a parsel.Selector and provides shortcuts for its methods.

Mixins

class web_poet.mixins.ResponseShortcutsMixin[source]

Common shortcut methods for working with HTML responses. This mixin could be used with Page Object base classes.

It requires “response” attribute to be present.

property url: str: Shortcut to HTML Response’s URL, as a string.

property html: str: Shortcut to HTML Response’s content.

property base_url: str: Return the base url of the given response

urljoin(url: str) → str[source]: Convert url to absolute, taking in account url and baseurl of the response

css(query) → SelectorList: A shortcut to .selector.css().

jmespath(query: str, **kwargs) → SelectorList: A shortcut to .selector.jmespath().

property selector: Selector: Cached instance of parsel.selector.Selector.

xpath(query, **kwargs) → SelectorList: A shortcut to .selector.xpath().

class web_poet.mixins.SelectableMixin[source]

Inherit from this mixin, implement ._selector_input method, get .selector property and .xpath / .css / .jmespath methods.

property selector: Selector: Cached instance of parsel.selector.Selector.

css(query) → SelectorList: A shortcut to .selector.css().

jmespath(query: str, **kwargs) → SelectorList: A shortcut to .selector.jmespath().

xpath(query, **kwargs) → SelectorList: A shortcut to .selector.xpath().

Requests

web_poet.requests.RequestDownloaderT

Frameworks that wants to support additional requests in web-poet should set the appropriate implementation of request_downloader_var for requesting data.

alias of Callable[[HttpRequest], Awaitable[HttpResponse]]

Exceptions

Core Exceptions

These exceptions are tied to how web-poet operates.

exception web_poet.exceptions.core.NoSavedHttpResponse(msg: str | None = None, request: HttpRequest | None = None)[source]

Indicates that there is no saved response for this request.

Can only be raised when a HttpClient instance is used to get saved responses.

Parameters:: request (HttpRequest) – The HttpRequest instance that was used.

exception web_poet.exceptions.core.PageObjectAction[source]: Base class for exceptions that can be raised from a page object to indicate something to be done about that page object.

exception web_poet.exceptions.core.RequestDownloaderVarError[source]

The web_poet.request_downloader_var had its contents accessed but there wasn’t any value set during the time requests are executed.

See the documentation section about setting up the contextvars to learn more about this.

exception web_poet.exceptions.core.Retry(message: str | None = None, max_retries: int | None = None)[source]

The page object found that the input data is partial or empty, and a request retry may provide better input.

message is the reason for the retry.

max_retries is the desired maximum retries. If not specified, the framework defaults are used instead.

exception web_poet.exceptions.core.UseFallback[source]: The page object cannot extract data from the input, but the input seems valid, so an alternative data extraction implementation for the same item type may succeed.

HTTP Exceptions

These are exceptions pertaining to common issues faced when executing HTTP operations.

exception web_poet.exceptions.http.HttpError(msg: str | None = None, request: HttpRequest | None = None)[source]

Bases: OSError

Indicates that an exception has occurred when handling an HTTP operation.

This is used as a base class for more specific errors and could be vague since it could denote problems either in the HTTP Request or Response.

For more specific errors, it would be better to use HttpRequestError and HttpResponseError.

Parameters:: request (HttpRequest) – Request that triggered the exception.

request: HttpRequest | None: Request that triggered the exception.

exception web_poet.exceptions.http.HttpRequestError(msg: str | None = None, request: HttpRequest | None = None)[source]

Bases: HttpError

Indicates that an exception has occurred when the HTTP Request was being handled.

Parameters:: request (HttpRequest) – The HttpRequest instance that was used.

exception web_poet.exceptions.http.HttpResponseError(msg: str | None = None, response: HttpResponse | None = None, request: HttpRequest | None = None)[source]

Bases: HttpError

Indicates that an exception has occurred when the HTTP Response was received.

For responses that are in the status code 100-3xx range, this exception shouldn’t be raised at all. However, for responses in the 400-5xx, this will be raised by web-poet.

Note

Frameworks implementing web-poet should NOT raise this exception.

This exception is raised by web-poet itself, based on allow_status parameter found in the methods of HttpClient.

Parameters:

request (HttpRequest) – Request that got the response that triggered the exception.
response (HttpResponse) – Response that triggered the exception.

response: HttpResponse | None: Response that triggered the exception.

Apply Rules

See Rules for more context about its use cases and some examples.

web_poet.default_registry: Default RulesRegistry.

web_poet.handle_urls(): handle_urls() of the default_registry.

class web_poet.rules.ApplyRule(for_patterns: str | Patterns, *, use: type[ItemPage], instead_of: type[ItemPage] | None = None, to_return: type[Any] | None = None, meta: dict[str, Any] = NOTHING)[source]

A rule that primarily applies Page Object and Item overrides for a given URL pattern.

This is instantiated when using the web_poet.handle_urls() decorator. It’s also being returned as a List[ApplyRule] when calling the web_poet.default_registry’s get_rules() method.

You can access any of its attributes:

for_patterns - contains the list of URL patterns associated with this rule. You can read the API documentation of the url-matcher package for more information about the patterns.

use - The Page Object that will be used in cases where the URL pattern represented by the for_patterns attribute is matched.

instead_of - (optional) The Page Object that will be replaced with the Page Object specified via the use parameter.

to_return - (optional) The item class that the Page Object specified in use is capable of returning.

meta - (optional) Any other information you may want to store. This doesn’t do anything for now but may be useful for future API updates.

The main functionality of this class lies in the instead_of and to_return parameters. Should both of these be omitted, then ApplyRule simply tags which URL patterns the given Page Object defined in use is expected to be used on.

When to_return is not None (e.g. to_return=MyItem), the Page Object in use is declared as capable of returning a certain item class (i.e. MyItem).

When instead_of is not None (e.g. instead_of=ReplacedPageObject), the rule adds an expectation that the ReplacedPageObject wouldn’t be used for the URLs matching for_patterns, since the Page Object in use will replace it.

If there are multiple rules which match a certain URL, the rule to apply is picked based on the priorities set in for_patterns.

More information regarding its usage in Rules.

Tip

The ApplyRule is also hashable. This makes it easy to store unique rules and identify any duplicates.

class web_poet.rules.RulesRegistry(*, rules: Iterable[ApplyRule] | None = None)[source]

RulesRegistry provides features for storing, retrieving, and searching for the ApplyRule instances.

web-poet provides a default Registry named default_registry for convenience. It can be accessed this way:

from web_poet import handle_urls, default_registry, WebPage
from my_items import Product

@handle_urls("example.com")
class ExampleComProductPage(WebPage[Product]): ...

rules = default_registry.get_rules()

The @handle_urls decorator exposed as web_poet.handle_urls is a shortcut for default_registry.handle_urls.

Note

It is encouraged to use the web_poet.default_registry instead of creating your own RulesRegistry instance. Using multiple registries would be unwieldy in most cases.

However, it might be applicable in certain scenarios like storing custom rules to separate it from the default_registry.

add_rule(rule: ApplyRule) → None[source]: Registers an web_poet.rules.ApplyRule instance.

Class decorator that indicates that the decorated Page Object should work for the given URL patterns.

The URL patterns are matched using the include and exclude parameters while priority breaks any ties. See the documentation of the url-matcher package for more information about them.

This decorator is able to derive the item class returned by the Page Object. This is important since it marks what type of item the Page Object is capable of returning for the given URL patterns. For certain advanced cases, you can pass a to_return parameter which replaces any derived values (though this isn’t generally recommended).

Passing another Page Object into the instead_of parameter indicates that the decorated Page Object will be used instead of that for the given set of URL patterns. See Rule precedence.

Any extra parameters are stored as meta information that can be later used.

Parameters:

include – The URLs that should be handled by the decorated Page Object.
instead_of – The Page Object that should be replaced.
to_return – The item class holding the data returned by the Page Object. This could be omitted as it could be derived from the Returns[ItemClass] or ItemPage[ItemClass] declaration of the Page Object. See Items section.
exclude – The URLs for which the Page Object should not be applied.
priority – The resolution priority in case of conflicting rules. A conflict happens when the include, override, and exclude parameters are the same. If so, the highest priority will be chosen.

get_rules() → list[ApplyRule][source]: Return all the ApplyRule that were declared using the @handle_urls decorator.

Note

Remember to consider calling consume_modules() beforehand to recursively import all submodules which contains the @handle_urls decorators from external Page Objects.

search(**kwargs: Any) → list[ApplyRule][source]

Return any ApplyRule from the registry that matches with all the provided attributes.

Sample usage:

rules = registry.search(use=ProductPO, instead_of=GenericPO)
print(len(rules))  # 1
print(rules[0].use)  # ProductPO
print(rules[0].instead_of)  # GenericPO

overrides_for(url: _Url | str) → Mapping[type[ItemPage], type[ItemPage]][source]: Finds all of the page objects associated with the given URL and returns a Mapping where the ‘key’ represents the page object that is overridden by the page object in ‘value’.

page_cls_for_item(url: _Url | str, item_cls: type) → type | None[source]: Return the page object class associated with the given URL that’s able to produce the given item_cls.

top_rules_for_item(url: _Url | str, item_cls: type) → Generator[ApplyRule][source]

Iterates the top rules that apply for url and item_cls.

If multiple rules score the same, multiple rules are iterated. This may be useful, for example, if you want to apply some custom logic to choose between rules that otherwise have the same score. For example:

from web_poet import default_registry


def browser_page_cls_for_item(url, item_cls):
    fallback = None
    for rule in default_registry.top_rules_for_item(url, item_cls):
        if rule.meta.get("browser", False):
            return rule.use
        if not fallback:
            fallback = rule.use
    if not fallback:
        raise ValueError(f"No rule found for URL {url!r} and item class {item_cls}")
    return fallback

web_poet.rules.consume_modules(*modules: str) → None[source]

This recursively imports all packages/modules so that the @handle_urls decorators are properly discovered and imported.

Let’s take a look at an example:

# FILE: my_page_obj_project/load_rules.py

from web_poet import default_registry, consume_modules

consume_modules("other_external_pkg.po", "another_pkg.lib")
rules = default_registry.get_rules()

For this case, the ApplyRule are coming from:

my_page_obj_project (since it’s the same module as the file above)

other_external_pkg.po

another_pkg.lib

any other modules that was imported in the same process inside the packages/modules above.

If the default_registry had other @handle_urls decorators outside of the packages/modules listed above, then the corresponding ApplyRule won’t be returned. Unless, they were recursively imported in some way similar to consume_modules().

Fields

web_poet.fields is a module with helpers for putting extraction logic into separate Page Object methods / properties.

class web_poet.fields.FieldInfo(name: str, meta: dict | None = None, out: list[Callable] | None = None)[source]

Information about a field

name: str: name of the field

meta: dict | None: field metadata

out: list[Callable] | None: field processors

class web_poet.fields.FieldsMixin[source]: A mixin which is required for a class to support fields

web_poet.fields.field(method=None, *, cached: bool = False, meta: dict | None = None, out: list[Callable] | None = None)[source]

Page Object method decorated with @field decorator becomes a property, which is then used by ItemPage’s to_item() method to populate a corresponding item attribute.

By default, the value is computed on each property access. Use @field(cached=True) to cache the property value.

The meta parameter allows to store arbitrary information for the field, e.g. @field(meta={"expensive": True}). This information can be later retrieved for all fields using the get_fields_dict() function.

The out parameter is an optional list of field processors, which are functions applied to the value of the field before returning it.

web_poet.fields.get_fields_dict(cls_or_instance) → dict[str, FieldInfo][source]: Return a dictionary with information about the fields defined for the class: keys are field names, and values are web_poet.fields.FieldInfo instances.

async web_poet.fields.item_from_fields(obj, item_cls: type[T] = <class 'dict'>, *, skip_nonitem_fields: bool = False) → T[source]

Return an item of item_cls type, with its attributes populated from the obj methods decorated with field decorator.

If skip_nonitem_fields is True, @fields whose names are not among item_cls field names are not passed to item_cls.__init__.

When skip_nonitem_fields is False (default), all @fields are passed to item_cls.__init__, possibly causing exceptions if item_cls.__init__ doesn’t support them.

web_poet.fields.item_from_fields_sync(obj, item_cls: type[T] = <class 'dict'>, *, skip_nonitem_fields: bool = False) → T[source]: Synchronous version of item_from_fields().

Layouts

web_poet.layout_switch(cls: type[ItemPage] | None = None, *, switch_method: str = 'get_layout', layouts: Iterable[type[ItemPage]] | None = None)[source]

Decorate a page object class to expose fields from its selected layout.

The decorated class must define a method named by switch_method ("get_layout" by default). The method can be synchronous or asynchronous and must return an ItemPage instance.

By default, forwarded fields are inferred from the output item type field names. This keeps forwarding aligned with the declared item schema.

For output item types that do not expose field names (for example, plain dict), pass layouts explicitly. In that case, forwarded fields are the union of fields defined across the provided layout page object classes.

If the decorated class already defines a field with the same name, layout_switch() gives priority to the selected layout field and falls back to the decorated class field when the selected layout does not define that field.

Annotation support

web_poet.annotation_encode(obj: Any) → Any[source]

Encodes obj for Annotated.

Annotated params must be hashable. This function converts dicts and lists into hashable alternatives (tuples and frozensets).

For example:

foo = Annotated(Bar, annotation_encode({"a": [1, 2, 3]}))

obj must not contain tuples or frozensets, or unhashable data besides dicts and lists.

web_poet.annotation_decode(obj: Any) → Any[source]: Converts a result of annotation_encode() back to original form.

class web_poet.AnnotatedInstance(result: Any, metadata: tuple[Any, ...])[source]

Wrapper for instances of annotated dependencies.

It is used when both the dependency value and the dependency annotation are needed.

Parameters:

result (Any) – The wrapped dependency instance.
metadata (Tuple[Any, ...]) – The copy of the annotation.

Utils

web_poet.utils.get_fq_class_name(cls: type) → str[source]

Return the fully qualified name for a type.

>>> from web_poet import Injectable
>>> get_fq_class_name(Injectable)
'web_poet.pages.Injectable'
>>> from decimal import Decimal
>>> get_fq_class_name(Decimal)
'decimal.Decimal'

web_poet.utils.memoizemethod_noargs(method: CallableT) → CallableT[source]

Decorator to cache the result of a method (without arguments) using a weak reference to its object.

It is faster than cached_method(), and doesn’t add new attributes to the instance, but it doesn’t work if objects are unhashable.

web_poet.utils.cached_method(method: CallableT) → CallableT[source]

A decorator to cache method or coroutine method results, so that if it’s called multiple times for the same instance, computation is only done once.

The cache is unbound, but it’s tied to the instance lifetime.

Note

cached_method() is needed because functools.lru_cache() doesn’t work well on methods: self is used as a cache key, so a reference to an instance is kept in the cache, and this prevents deallocation of instances.

This decorator adds a new private attribute to the instance named _cached_method_{decorated_method_name}; make sure the class doesn’t define an attribute of the same name.

web_poet.utils.as_list(value: Any) → list[Any][source]

Normalizes the value input as a list.

>>> as_list(None)
[]
>>> as_list("foo")
['foo']
>>> as_list(123)
[123]
>>> as_list(["foo", "bar", 123])
['foo', 'bar', 123]
>>> as_list(("foo", "bar", 123))
['foo', 'bar', 123]
>>> as_list(range(5))
[0, 1, 2, 3, 4]
>>> def gen():
...     yield 1
...     yield 2
>>> as_list(gen())
[1, 2]

async web_poet.utils.ensure_awaitable(obj)[source]: Return the value of obj, awaiting it if needed

web_poet.utils.get_generic_param(cls: type, expected: type | tuple[type, ...]) → type | None[source]

Search the base classes recursively breadth-first for a generic class and return its param.

Returns the param of the first found class that is a subclass of expected.

Built-in framework

Built-in web-poet framework for simple use cases.

class web_poet.framework.Framework(*, registry: RulesRegistry | None = None, default_playwright_engine: str | None = None, stats: StatCollector | None = None)[source]

Manager of the built-in framework.

registry is the RulesRegistry from where page objects resolve their dependencies. If None, default_registry is used.

default_playwright_engine is the Playwright browser engine to use when browser inputs do not specify one. Examples: "chromium", "firefox", "webkit".

stats is a StatCollector instance to collect stats written by the page object through the Stats dependency. If not specified, a DictStatCollector is used. You can access the collector through the stats attribute, e.g. to read its data.

async get_page(request: HttpRequest | RequestUrl | ResponseUrl | str, page_cls: type[ItemPage], *, page_params: dict[Any, Any] | None = None) → ItemPage[source]

Return a page object built from request and page_cls.

page_params is a dict that the page object may access through the PageParams dependency.

async get_item(request: HttpRequest | RequestUrl | ResponseUrl | str, item_or_page_cls: type, *, page_params: dict[Any, Any] | None = None) → Any[source]

Return an item built from request.

item_or_page_cls is either an item class or a page object class. If it is an item class, the page class to use is determined by the RulesRegistry passed to Framework.

page_params is a dict that the page object may access through the PageParams dependency

web_poet.framework.playwright_engine(name: str) → str[source]

Helper to create a hashable metadata value for Annotated Playwright engine names.

Example usage:

Annotated[BrowserResponse, playwright_engine("firefox")]