Fixed an issue with
HttpClientwhich happens when a response with a non-standard status code is received.
A new dependency
BrowserResponsehas been added. It contains a browser-rendered page URL, status code and HTML.
The Rules documentation section has been rewritten.
The testing framework now allows defining a custom item adapter.
We have made a backward-incompatible change on test fixture serialization: the
type_namefield of exceptions has been renamed to
Fixed built-in Python types, e.g.
int, not working as field processors.
JMESPath support is now available: you can use
HttpResponse.jmespath()to run queries on JSON responses.
The testing framework now supports page objects that raise exceptions from the
Extractorcan be used for easier extraction of nested fields (see Processors for nested fields).
Exceptions raised while getting a response for an additional request are now saved in test fixtures.
Multiple documentation improvements and fixes.
twine checkCI check.
Standardized input validation.
Field processors can now also be defined through a nested
Processorsclass, so that field redefinitions in subclasses can inherit them. See Default processors.
Field processors can now opt in to receive the page object whose field is being read.
web_poet.fields.FieldsMixinnow keeps fields from all base classes when using multiple inheritance.
Fixed the documentation build.
Fix the error when calling
item_from_fields()on page objects defined as slotted attrs classes, while setting
This release contains many improvements to the web-poet testing framework, as well as some other improvements and bug fixes.
cached_method()no longer caches exceptions for
async defmethods. This makes the behavior the same for sync and async methods, and also makes it consistent with Python’s stdlib caching (i.e.
The testing framework now uses the
HttpResponse-info.jsonfile name instead of
HttpResponse-other.jsonto store information about HttpResponse instances. To make tests generated with older web-poet work, rename these files on disk.
Testing framework improvements:
Improved test reporting: better diffs and error messages.
By default, the pytest plugin now generates a test per item attribute (see Running tests). There is also an option (
--web-poet-test-per-item) to run a test per item instead.
Page objects with the
HttpClientdependency are now supported (see Additional requests support).
Page objects with the
PageParamsdependency are now supported.
Added a new
python -m web_poet.testing reruncommand (see Test-Driven Development).
Fixed support for nested (indirect) dependencies in page objects. Previously they were not handled properly by the testing framework.
Non-ASCII output is now stored without escaping in the test fixtures, for better readability.
Testing and CI fixes.
Fixed a packaging issue:
tests_extrapackages were installed, not just
Restore the minimum version of
itemadapterfrom 0.7.1 to 0.7.0, and prevent a similar issue from happening again in the future.
Updated the tutorial to cover recent features and focus on best practices. Also, a new module was added,
web_poet.example, that allows using page objects while following the tutorial.
Tests for page objects now covers Git LFS and scrapy-poet, and recommends
python -m pytestinstead of
Improved the warning message when duplicate
ApplyRuleobjects are found.
HttpResponse-other.jsoncontent is now indented for better readability.
Improved test coverage for fields.
Add a framework for creating tests and running them with pytest.
Support implementing fields in mixin classes.
Introduce new methods for
Improved the performance of
web_poet.rules.RulesRegistry.search()where passing a single parameter of either
to_returnresults in O(1) look-up time instead of O(N). Additionally, having either
to_returnpresent in multi-parameter search calls would filter the initial candidate results resulting in a faster search.
Support page object dependency serialization.
Add new dependencies used in testing and serialization code:
backports.zoneinfoon non-Windows platforms when the Python version is older than 3.9.
In this release, the
@handle_urls decorator gets an overhaul; it’s not
required anymore to pass another Page Object class to
@web_poet.field decorator gets support for output processing
functions, via the
Full list of changes:
PageObjectRegistryis no longer supporting dict-like access.
Official support for Python 3.11.
@web_poet.field(out=[...])argument which allows to set output processing functions for web-poet fields.
web_poet.overridesmodule is deprecated and replaced with
@handle_urlsdecorator is now creating
ApplyRuleinstances instead of
ApplyRuleis similar to
OverrideRule, but has the following differences:
to_returnparameter, which should be the data container (item) class that the Page Object returns.
Passing a string to
for_patternswould auto-convert it into
All arguments are now keyword-only except for
New signature and behavior of
overridesparameter is made optional and renamed to
If defined, the item class declared in a subclass of
web_poet.ItemPageis used as the
handle_urlsannotations are allowed.
PageObjectRegistryis replaced with
RulesRegistry; its API is changed:
backwards incompatible dict-like API is removed;
backwards incompatible O(1) lookups using
.search(use=PagObject)has become O(N);
search_overridesmethod is renamed to
get_overridesmethod is renamed to
from_override_rulesmethod is deprecated; use
Documentation, test, and warning message improvements.
web_poet.overridesmodule is deprecated. Use
@handle_urlsis now deprecated. Use the
OverrideRuleclass is now deprecated. Use
PageObjectRegistryis now deprecated. Use
PageObjectRegistryis now deprecated. Use
PageObjectRegistry.get_overridesmethod is deprecated. Use
PageObjectRegistry.search_overridesmethod is deprecated. Use
The BOM encoding from the response body is now read before the response headers when deriving the response encoding.
Minor typing improvements.
Web-poet now includes a mini-framework for organizing extraction code as Page Object properties:
import attrs from web_poet import field, ItemPage @attrs.define class MyItem: foo: str bar: list[str] class MyPage(ItemPage[MyItem]): @field def foo(self): return "..." @field def bar(self): return ["...", "..."]
Backwards incompatible changes:
web_poet.ItemPageis no longer an abstract base class which requires
to_itemmethod to be implemented. Instead, it provides a default
async def to_itemmethod implementation which uses fields marked as
web_poet.fieldto create an item. This change shouldn’t affect the user code in a backwards incompatible way, but it might affect typing.
web_poet.ItemWebPageis deprecated. Use
web-poet is declared as PEP 561 package which provides typing information; mypy is going to use it by default.
Documentation, test, typing and CI improvements.
HttpResponse.urljoinmethod, which take page’s base url in account.
web_poet.exceptions.Retryexception, which allows to initiate a retry from the Page Object, e.g. based on page content.
Backwards Incompatible Change:
web_poet.requests.request_backend_varis renamed to
Documentation and CI improvements.
Backward Incompatible Change:
ResponseDatais replaced with
HttpResponseexposes methods useful for web scraping (such as xpath and css selectors, json loading), and handles web page encoding detection. There are also new types like
Added support for performing additional requests using
web_poet.PageParamsto pass arbitrary information inside a Page Object.
web_poet.handle_urlsdecorator, which allows to declare which websites should be handled by the page objects. Lower-level
PageObjectRegistryclass is also available.
removed support for Python 3.6
added support for Python 3.10
WebPage, ItemPage, ItemWebPage, Injectable and ResponseData are available as top-level imports (e.g.