web-poet¶
web-poet
is a Python 3.8+ implementation of the page object pattern for
web scraping. It enables writing portable, reusable web parsing code.
Warning
web-poet is in early stages of development; backward-incompatible changes are possible.
Overview¶
A good web scraping framework helps to keep your code maintainable by, among other things, enabling and encouraging separation of concerns.
For example, Scrapy lets you implement different aspects of web scraping, like ban avoidance or data delivery, into separate components.
However, there are 2 core aspects of web scraping that can be hard to decouple: crawling, i.e. visiting URLs, and parsing, i.e. extracting data.
web-poet lets you write data extraction code that:
Makes your web scraping code easier to maintain, since your data extraction and crawling code are no longer intertwined and can be maintained separately.
Can be reused with different versions of your crawling code, i.e. with different crawling strategies.
Can be executed independently of your crawling code, enabling easier debugging and easier automated testing.
Can be used with any Python web scraping framework or library that implements the web-poet specification, either directly or through a third-party plugin. See Frameworks.
To learn more about why and how web-poet came to be, see From the ground up.
Installation¶
To be able to write page objects and test them, install web-poet from PyPI:
pip install web-poet
To use page objects in production, however, you will need a web-poet framework.
Tutorial¶
In this tutorial you will learn to use web-poet as you write web scraping code for book detail pages from books.toscrape.com.
To follow this tutorial you must first be familiar with Python and have installed web-poet.
Create a project directory¶
web-poet does not limit how you structure your web-poet web scraping code, beyond the limitations of Python itself.
However, in this tutorial you will use a specific project directory structure designed with web-poet best practices in mind. Consider using a similar project directory structure in all your web-poet projects.
First create your project directory: tutorial-project/
.
Within the tutorial-project
directory, create:
A
run.py
file, a file specific to this tutorial where you will put code to test the execution of your web scraping code.A
tutorial
directory, where you will place your web scraping code.
Within the tutorial-project/tutorial
directory, create:
An
__init__.py
file, so that thetutorial
directory becomes an importable Python module.An
items.py
file, where you will define item classes to store extracted data.A
pages
directory, where you will define your page object classes.
Within the tutorial-project/tutorial/pages
directory, create:
An
__init__.py
file.A
books_toscrape_com.py
file, for page object class code targeting books.toscrape.com.
Your project directory should look as follows:
tutorial-project
├── run.py
└── tutorial
├── __init__.py
├── items.py
└── pages
├── __init__.py
└── books_toscrape_com.py
Create an item class¶
While it is possible to store the extracted data in a Python dictionary, it is a good practice to create an item class that:
Defines the specific attributes that you aim to extract, triggering an exception if you extract unintended attributes or fail to extract expected attributes.
Allows defining default values for some attributes.
web-poet uses itemadapter for item class support, which means that any kind of item class can be used. In this tutorial, you will use attrs to define your item class.
Copy the following code into tutorial-project/tutorial/items.py
:
from attrs import define
@define
class Book:
title: str
This code defines a Book
item class, with a single required title
string attribute to store the book title.
Book
is a minimal class designed specifically for this tutorial. In real
web-poet projects, you will usually define item classes with many more
attributes.
Tip
For an example of real item classes, see the zyte-common-items library.
Also mind that, while in this tutorial you use Book
only for data from 1
website, books.toscrape.com, item classes are usually meant to be usable for
many different websites that provide data with a similar data schema.
Create a page object class¶
To write web parsing code with web-poet, you write page object classes, Python classes that define how to extract data from a given type of input, usually some type of webpage from a specific website.
In this tutorial you will write a page object class for webpages of books.toscrape.com that show details about a book, such as these:
http://books.toscrape.com/catalogue/the-exiled_247/index.html
http://books.toscrape.com/catalogue/when-we-collided_955/index.html
http://books.toscrape.com/catalogue/set-me-free_988/index.html
Copy the following code into
tutorial-project/tutorial/pages/books_toscrape_com.py
:
from web_poet import field, handle_urls, WebPage
from ..items import Book
@handle_urls("books.toscrape.com")
class BookPage(WebPage[Book]):
@field
async def title(self):
return self.css("h1::text").get()
In the code above:
You define a page object class named
BookPage
by subclassingWebPage
.It is possible to create a page object class subclassing instead the simpler
ItemPage
class. However,WebPage
:Indicates that your page object class requires an HTTP response as input, which gets stored in the
response
attribute of your page object class as anHttpResponse
object.Provides attributes like
html
andurl
, and methods likecss()
,urljoin()
, andxpath()
, that make it easier to write parsing code.
BookPage
declaresBook
as its return type.WebPage
, like its parent classItemPage
, is a generic class that accepts a type parameter. Unlike most generic classes, however, the specified type parameter is used for more than type hinting: it determines the item class that is used to store the data that fields return.BookPage
is decorated withhandle_urls()
, which indicates for which domainBookPage
is intended to work.It is possible to specify more specific URL patterns, instead of only the target URL domain. However, the URL domain and the output type (
Book
) are usually all the data needed to determine which page object class to use, which is the goal of thehandle_urls()
decorator.BookPage
defines a field namedtitle
.Fields are methods of page object classes, preferably async methods, decorated with
field()
. Fields define the logic to extract a specific piece of information from the input of your page object class.BookPage.title
extracts the title of a book from a book details webpage. Specifically, it extracts the text from the firsth1
element on the input HTTP response.Here,
title
is not an arbitrary name. It was chosen specifically to matchBook.title
, so that during parsing the value thatBookPage.title
returns gets mapped toBook.title
.
Use your page object class¶
Now that you have a page object class defined, it is time to use it.
First, install requests, which is required by web_poet.example
.
Then copy the following code into tutorial-project/run.py
:
from web_poet import consume_modules
from web_poet.example import get_item
from tutorial.items import Book
consume_modules("tutorial.pages")
item = get_item(
"http://books.toscrape.com/catalogue/the-exiled_247/index.html",
Book,
)
print(item)
Execute that code:
python tutorial-project/run.py
And the print(item)
statement should output the following:
Book(title='The Exiled')
In this tutorial you use web_poet.example.get_item
, which is a simple,
incomplete implementation of the web-poet specification, built specifically for
this tutorial, for demonstration purposes. In real projects, use instead an
actual web-poet framework.
web_poet.example.get_item
serves to illustrate the power of web-poet: once
you have defined your page object class, a web-poet framework only needs 2
inputs from you:
the URL from which you want to extract data, and
the desired output, either a page object class or, in this case, an item class.
Notice that you must also call consume_modules()
once
before your first call to get_item
. consume_modules
ensures that the
specified Python modules are loaded. You pass consume_modules
the import
paths of the modules where your page object classes are defined. After loading
those modules, handle_urls()
decorators register the page
object classes that they decorate into web_poet.default_registry
, which
get_item
uses to determine which page object class to use based on its
input parameters (URL and item class).
Your web-poet framework can take care of everything else:
It matches the input URL and item class to
BookPage
, based on the URL pattern that you defined with thehandle_urls()
decorator, and the return type that you declared in the page object class (Book
).It inspects the inputs declared by
BookPage
, and builds an instance ofBookPage
with the required inputs.BookPage
is aWebPage
subclass, andWebPage
declares an attribute namedresponse
of typeHttpResponse
. Your web-poet framework sees this, and creates anHttpResponse
object from the input URL as a result, by downloading the URL response, and assigns that object to theresponse
attribute of a newBookPage
object.It builds the output item,
Book(title='The Exiled')
, using theto_item()
method ofBookPage
, inherited fromItemPage
, which in turn uses all fields ofBookPage
to create an instance ofBook
, which you declared as the return type ofBookPage
.
Extend and override your code¶
To continue this tutorial, you will need extended versions of Book
and
BookPage
, with additional fields. However, rather than editing the existing
Book
and BookPage
classes, you will see how you can instead create new
classes that inherit them.
Append the following code to tutorial-project/tutorial/items.py
:
from typing import Optional
@define
class CategorizedBook(Book):
category: str
category_rank: Optional[int] = None
The code above defines a new item class, CategorizedBook
, that inherits the
title
attribute from Book
and defines 2 more attributes: category
and category_rank
.
Append the following code to
tutorial-project/tutorial/pages/books_toscrape_com.py
:
from web_poet import Returns
from ..items import CategorizedBook
@handle_urls("books.toscrape.com")
class CategorizedBookPage(BookPage, Returns[CategorizedBook]):
@field
async def category(self):
return self.css(".breadcrumb a::text").getall()[-1]
In the code above:
You define a new page object class:
CategorizedBookPage
.CategorizedBookPage
subclassesBookPage
, inheriting itstitle
field, and defining a new one:category
.CategorizedBookPage
does not define acategory_rank
field yet, you will add it later on. For now, the default value defined inCategorizedBook
forcategory_rank
will beNone
.CategorizedBookPage
indicates that it returns aCategorizedBook
object.WebPage
is a generic class, which is why we could useWebPage[Book]
in the definition ofBookPage
to indicateBook
as the output type ofBookPage
. However,BookPage
is not a generic class, so something likeBookPage[CategorizedBook]
would not work.So instead you use
Returns
, a special, generic class that you can inherit to re-define the output type of your page object subclasses.
After you update your tutorial-project/run.py
script to request a
CategorizedBook
item:
from web_poet import consume_modules
from web_poet.example import get_item
from tutorial.items import CategorizedBook
consume_modules("tutorial.pages")
item = get_item(
"http://books.toscrape.com/catalogue/the-exiled_247/index.html",
CategorizedBook,
)
print(item)
And you execute it again:
python tutorial-project/run.py
You can see in the new output that your new classes have been used:
CategorizedBook(title='The Exiled', category='Mystery', category_rank=None)
Use additional requests¶
To extract data about an item, sometimes the HTTP response to a single URL is
not enough. Sometimes, you need additional HTTP responses to get all the data
that you want. That is the case with the category_rank
attribute.
The category_rank
attribute indicates the position in which a book appears
in the list of books of the category of that book. For example,
The Exiled is 24th in the Mystery category, so the value of
category_rank
should be 24
for that book.
However, there is no indication of this value in the book details page. To get this value, you need to visit the URL of the category of the book whose data you are extracting, find the entry of that book within the grid of books of the category, and record in which position you found it. And categories with more than 20 books are split into multiple pages, so you may need more than 1 additional request for some books.
Extend CategorizedBookPage
in
tutorial-project/tutorial/pages/books_toscrape_com.py
as follows:
from attrs import define
from web_poet import HttpClient, Returns
from ..items import CategorizedBook
@handle_urls("books.toscrape.com")
@define
class CategorizedBookPage(BookPage, Returns[CategorizedBook]):
http: HttpClient
_books_per_page = 20
@field
async def category(self):
return self.css(".breadcrumb a::text").getall()[-1]
@field
async def category_rank(self):
response, book_url, page = self.response, self.url, 0
category_page_url = self.css(".breadcrumb a::attr(href)").getall()[-1]
while category_page_url:
category_page_url = response.urljoin(category_page_url)
response = await self.http.get(category_page_url)
urls = response.css("h3 a::attr(href)").getall()
for position, url in enumerate(urls, start=1):
url = str(response.urljoin(url))
if url == book_url:
return page * self._books_per_page + position
category_page_url = response.css(".next a::attr(href)").get()
if not category_page_url:
return None
page += 1
In the code above:
You declare a new input in
CategorizedBookPage
,http
, of typeHttpClient
.You also add the
@attrs.define
decorator toCategorizedBookPage
, as it is required when adding new required attributes to subclasses of attrs classes.You define the
category_rank
field so that it uses thehttp
input object to send additional requests to find the position of the current book within its category.Specifically:
You extract the category URL from the book details page.
You visit that category URL, and you iterate over the listed books until you find one with the same URL as the current book.
If you find a match, you return the position at which you found the book.
If there is no match, and there is a next page, you repeat the previous step with the URL of that next page as the category URL.
If at some point there are no more “next” pages and you have not yet found the book, you return
None
.
When you execute tutorial-project/run.py
now, category_rank
has
the expected value:
CategorizedBook(title='The Exiled', category='Mystery', category_rank=24)
Use parameters¶
You may notice that the execution takes longer now. That is because
CategorizedBookPage
now requires 2 or more requests, to find the value of
the category_rank
attribute.
If you use CategorizedBookPage
as part of a web scraping project that
targets a single book URL, it cannot be helped. If you want to extract the
category_rank
attribute, you need those additional requests. Your only
option to avoid additional requests is to stop extracting the category_rank
attribute.
However, if your web scraping project is targeting all book URLs from one or
more categories by visiting those category URLs, extracting book URLs from
them, and then using CategorizedBookPage
with those book URLs as input,
there is something you can change to save many requests: keep track of the
positions where you find books as you visit their categories, and pass that
position to CategorizedBookPage
as additional input.
Extend CategorizedBookPage
in
tutorial-project/tutorial/pages/books_toscrape_com.py
as follows:
from attrs import define
from web_poet import HttpClient, PageParams, Returns
from ..items import CategorizedBook
@handle_urls("books.toscrape.com")
@define
class CategorizedBookPage(BookPage, Returns[CategorizedBook]):
http: HttpClient
page_params: PageParams
_books_per_page = 20
@field
async def category(self):
return self.css(".breadcrumb a::text").getall()[-1]
@field
async def category_rank(self):
category_rank = self.page_params.get("category_rank")
if category_rank is not None:
return category_rank
response, book_url, page = self.response, self.url, 0
category_page_url = self.css(".breadcrumb a::attr(href)").getall()[-1]
while category_page_url:
category_page_url = response.urljoin(category_page_url)
response = await self.http.get(category_page_url)
urls = response.css("h3 a::attr(href)").getall()
for position, url in enumerate(urls, start=1):
url = str(response.urljoin(url))
if url == book_url:
return page * self._books_per_page + position
category_page_url = response.css(".next a::attr(href)").get()
if not category_page_url:
return None
page += 1
In the code above, you declare a new input in CategorizedBookPage
,
page_params
, of type PageParams
.
It is a dictionary of parameters that you may receive from the code using your
page object class.
In the category_rank
field, you check if you have received a parameter also
called category_rank
, and if so, you return that value instead of using
additional requests to find the value.
You can now update your tutorial-project/run.py
script to pass that
parameter to get_item
:
item = get_item(
"http://books.toscrape.com/catalogue/the-exiled_247/index.html",
CategorizedBook,
page_params={"category_rank": 24},
)
When you execute tutorial-project/run.py
now, execution should take less
time, but the result should be the same as before:
CategorizedBook(title='The Exiled', category='Mystery', category_rank=24)
Only that now the value of category_rank
comes from
tutorial-project/run.py
, and not from additional requests sent by
CategorizedBookPage
.
From the ground up¶
Learn why and how web-poet came to be as you transform a simple, rigid starting web scraping code snippet into maintainable, reusable web-poet code.
Writing reusable parsing code¶
Imagine you are writing code to scrape a book web page from
books.toscrape.com, and you implement a
scrape
function like this:
import requests
from parsel import Selector
def scrape(url: str) -> dict:
response = requests.get(url)
selector = Selector(response.text)
return {
"url": response.url,
"title": selector.css("h1").get(),
}
item = scrape("http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html")
This scrape
function is simple, but it has a big issue: it only supports
downloading the specified URL using the requests library. What if you want to
use aiohttp, for concurrency support? What if you want to run scrape
with
a local snapshot of a URL response, to write an automated test for scrape
that does not rely on a network connection?
The first step towards addressing this issue is to split your scrape
function into 2 separate functions, download
and parse
:
import requests
from parsel import Selector
def parse(response: requests.Response) -> dict:
selector = Selector(response.text)
return {
"url": response.url,
"title": selector.css("h1").get(),
}
def download(url: str) -> requests.Response:
return requests.get(url)
url = "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
response = download(url)
item = parse(response)
Now that download
and parse
are separate functions, you can replace
download
with an alternative implementation that uses aiohttp, or that
reads from local files.
There is still an issue, though: parse
expects an instance of
requests.Response. Any alternative implementation of download
would need
to create a response object of the same type, forcing a dependency on
requests even if downloads are handled with a different library.
So you need to change the input of the parse
function into something that
will not tie you to a specific download library. One option is to create your
own, download-independent Response
class, to store the response data that
any download function should be able to provide:
import requests
from dataclasses import dataclass
from parsel import Selector
@dataclass
class Response:
url: str
text: str
def parse(response: Response) -> dict:
selector = Selector(response.text)
return {
"url": response.url,
"title": selector.css("h1").get(),
}
def download(url: str) -> Response:
response = requests.get(url)
return Response(url=response.url, text=response.text)
url = "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
response = download(url)
item = parse(response)
The parse
function is no longer tied to any specific download library, and
alternative versions of the download
function can be implemented with other
libraries.
Parsing with web-poet¶
web-poet asks you to organize your code in a very similar way. Let’s convert
the parse
function into a web-poet page object class:
import requests
from web_poet import Injectable, HttpResponse
class BookPage(Injectable):
def __init__(self, response: HttpResponse):
self.response = response
def to_item(self) -> dict:
return {
"url": self.response.url,
"title": self.response.css("h1").get(),
}
def download(url: str) -> Response:
response = requests.get(url)
return HttpResponse(
url=response.url,
body=response.content,
headers=response.headers,
)
url = "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
response = download(url)
book_page = BookPage(response=response)
item = book_page.to_item()
Differences from a previous example:
web-poet provides a standard
HttpResponse
class, with helper methods likecss()
.Note how headers are passed when creating an
HttpResponse
instance. This is needed to properly decode the body (which isbytes
) as text using web browser rules. It involves checking theContent-Encoding
header, HTML meta tags, BOM markers in the body, etc.Instead of the
parse
function we’ve got aBookPage
class, which inherits from theInjectable
base class, receives response data in its__init__
method, and returns the extracted item in theto_item()
method.to_item
is a standard method name used byweb-poet
.
Receiving a response
argument in __init__
is very common for page
objects, so web-poet
provides a shortcut for it: inherit from
WebPage
, which provides this __init__
method implementation. You
can then refactor your BookPage
class as follows:
from web_poet import WebPage
class BookPage(WebPage):
def to_item(self) -> dict:
return {
"url": self.response.url,
"title": self.response.css("h1").get(),
}
WebPage
even provides shortcuts for some response attributes and
methods:
from web_poet import WebPage
class BookPage(WebPage):
def to_item(self) -> dict:
return {
"url": self.url,
"title": self.css("h1").get(),
}
At this point you may be wondering why web-poet requires you to write a class
with a to_item
method rather than a function. The answer is flexibility.
For example, the use of a class instead of a function makes fields possible, which make parsing code easier to read:
from web_poet import WebPage, field
class BookPage(WebPage):
@field
def url(self):
return self.url
@field
def title(self):
return self.css("h1").get()
Using fields also makes it unnecessary to define to_item()
manually, and
allows reading individual fields when you don’t need the complete to_item()
output.
Note
The BookPage.to_item()
method is async
in the example above. See
Fields for more information.
Using classes also makes it easy, for example, to implement dependency injection, which is how web-poet builds inputs.
Downloading with web-poet¶
What about the implementation of the download
function? How would you
implement that in web-poet? Well, ideally, you wouldn’t.
To parse data from a web page using web-poet, you would only need to write the
parsing part, e.g. the BookPage
page object class above.
Then, you let a web-poet framework handle the download part
for you. You pass that framework the URL of a web page to parse, and either a
page object class (the BookPage
class here) or an item class, and that’s it:
item = some_framework.get(url, BookPage)
web-poet does not provide any framework, beyond an example one featured in the tutorial and not intended for production. The role of web-poet is to define a specification on how to write parsing logic so that it can be reused with different frameworks.
Page object classes should be flexible enough to be used with very different frameworks, including:
synchronous or asynchronous frameworks
asynchronous frameworks based on callbacks or based on coroutines (
async def / await
syntax)single-node and distributed systems
different underlying HTTP implementations, or even implementations with no HTTP support at all
Page objects¶
A page object is a code wrapper for a webpage, or for a part of a webpage, that implements the logic to parse the raw webpage data into structured data.
To use web-poet, define page object classes for your target websites, and get the output item using a web-poet framework.
Defining a page object class¶
A page object class is a Python class that:
Subclasses
ItemPage
.Declares typed input parameters in its
__init__
method.Uses fields.
Alternatively, you can implement a
to_item
method, which can be synchronous or asynchronous, and returns the webpage content as an item.
For example:
from web_poet import HttpResponse, ItemPage, field
class FooPage(ItemPage[MyItem]):
def __init__(self, response: HttpResponse):
self.response = response
@field
def foo(self) -> str:
return self.response.css(".foo").get()
Note
MyItem
in the code examples of this page is a placeholder for an
item class.
Minimizing boilerplate¶
There are a few ways for you to minimize boilerplate when defining a page object class.
For example, you can use attrs to remove the need for a custom __init__
method:
from attrs import define
from web_poet import HttpResponse, ItemPage, field
@define
class FooPage(ItemPage[MyItem]):
response: HttpResponse
@field
def foo(self) -> str:
return self.response.css(".foo").get()
If your page object class needs
HttpResponse
as input, there is also
WebPage
, an ItemPage
subclass
that declares an HttpResponse
input and
provides helper methods to use it:
from web_poet import WebPage, field
class FooPage(WebPage[MyItem]):
@field
def foo(self) -> str:
return self.css(".foo").get()
Getting the output item¶
You should include your page object classes into a page object
registry, e.g. decorate them with handle_urls()
:
from web_poet import WebPage, field, handle_urls
@handle_urls("example.com")
class FooPage(WebPage[MyItem]):
@field
def foo(self) -> str:
return self.css(".foo").get()
Then, provided your page object class code is imported (see
consume_modules()
), your framework
can build the output item after you provide the target URL and the desired
output item class, as shown in the tutorial.
Your framework chooses the right page object class based on your input
parameters, downloads the required data, builds a page object, and calls the
to_item
method of that page object.
Note that, while the examples above use dict
as an output item for
simplicity, using less generic item classes is recommended. That
way, you can use different page object classes, with different output items,
for the same website.
Getting a page object¶
Alternatively, frameworks can return a page object instead of an item, and you
can call to_item
yourself.
However, there are drawbacks to this approach:
to_item
can be synchronous or asynchronous, so you need to useensure_awaitable()
:from web_poet.utils import ensure_awaitable item = await ensure_awaitable(foo_page.to_item())
to_item
may raise certain exceptions, likeRetry
orUseFallback
, which, depending on your framework, may not be handled automatically when getting a page object instead of an item.
Building a page object manually¶
It is possible to create a page object from a page object class passing its
inputs as parameters. For example, to manually create an instance of the
FooPage
page object class defined above:
foo_page = FooPage(
response=HttpResponse(
"https://example.com",
b"<!DOCTYPE html>\n<title>Foo</title>",
),
)
However, your code will break if the page object class changes its inputs. Building page objects using frameworks prevents that.
Inputs¶
Page object classes, in their __init__
method,
must define input parameters with type hints pointing to input classes.
Those input classes may be:
Other page object classes.
Item classes, when using a framework that can provide item classes.
Any other class that subclasses
Injectable
or is registered or decorated withInjectable.register
.
Based on the target URL and parameter type hints, frameworks automatically build the required objects at run time, and pass
them to the __init__
method of the corresponding page object class.
For example, if a page object class has an __init__
parameter of type
HttpResponse
, and the target URL is
https://example.com, your framework would send an HTTP request to
https://example.com, download the response, build an
HttpResponse
object with the response data,
and pass it to the __init__
method of the page object class being used.
Built-in input classes¶
Warning
Not all frameworks support all web-poet built-in input classes.
The web_poet.page_inputs
module defines multiple classes that you can
define as inputs for a page object class, including:
HttpResponse
, a complete HTTP response, including URL, headers, and body. This is the most common input for a page object class. See Working with HttpResponse.HttpClient
, to send additional requests.RequestUrl
, the target URL before following redirects. Useful, for example, to skip the target URL download, and instead useHttpClient
to send a custom request based on parts of the target URL.PageParams
, to receive data from the crawling code.Stats
, to write key-value data pairs during parsing that you can inspect later, e.g. for debugging purposes.BrowserResponse
, which includes URL, status code andBrowserHtml
of a rendered web page.AnyResponse
, which either holdsBrowserResponse
orHttpResponse
as the.response
instance, depending on which one is available or is more appropriate.
Working with HttpResponse¶
HttpResponse
has many attributes and methods.
To get the entire response body, you can use body
for
the raw bytes
, text
for the str
(decoded with the detected encoding
), or json()
to load a JSON response as a Python data structure:
>>> response.body
b'{"foo": "bar"}'
>>> response.text
'{"foo": "bar"}'
>>> response.json()
{'foo': 'bar'}
There are also methods to select content from responses: jmespath()
for JSON and css()
and
xpath()
for HTML and XML:
>>> response.jmespath("foo")
[<Selector query='foo' data='bar'>]
>>> response.css("h1::text")
[<Selector query='descendant-or-self::h1/text()' data='Title'>]
>>> response.xpath("//h1/text()")
[<Selector query='//h1/text()' data='Title'>]
Custom input classes¶
You may define your own input classes if you are using a framework that supports it.
However, note that custom input classes may make your page object classes less portable across frameworks.
Items¶
The to_item
method of a page object class must
return an item.
An item is a data container object supported by the itemadapter library, such
as a dict
, an attrs class, or a dataclass()
class. For example:
@attrs.define
class MyItem:
foo: int
bar: str
Because itemadapter allows implementing support for arbitrary classes, any kind of Python object can potentially work as an item.
Defining the item class of a page object class¶
When inheriting from ItemPage
, indicate the item class to return
between brackets:
@attrs.define
class MyPage(ItemPage[MyItem]):
...
to_item
builds an instance of the specified item class
based on the page object class fields.
page = MyPage(...)
item = await page.to_item()
assert isinstance(item, MyItem)
You can also define ItemPage
subclasses that are not meant to be
used, only subclassed, and not annotate ItemPage
in them. You can
then annotate those classes when subclassing them:
@attrs.define
class MyBasePage(ItemPage):
...
@attrs.define
class MyPage(MyBasePage[MyItem]):
...
To change the item class of a subclass that has already defined its item class,
use Returns
:
@attrs.define
class MyOtherPage(MyPage, Returns[MyOtherItem]):
...
Best practices for item classes¶
To keep your code maintainable, we recommend you to:
Instead of
dict
, use proper item classes based ondataclasses
or attrs, to make it easier to detect issues like field name typos or missing required fields.Reuse item classes.
For example, if you want to extract product details data from 2 e-commerce websites, try to use the same item class for both of them. Or at least try to define a base item class with shared fields, and only keep website-specific fields in website-specific items.
Keep item classes as logic-free as possible.
For example, any parsing and field cleanup logic is better handled through page object classes, e.g. using field processors.
Having code that makes item field values different from their counterpart page object field values can subvert the expectations of users of your code, which might need to access page object fields directly, for example for field subset selection.
If you are looking for ready-made item classes, check out zyte-common-items.
Rules¶
Rules are ApplyRule
objects that tell web-poet which page
object class to use based on user input, i.e. the target
URL and the requested output class (a page object class or an item class).
Rules are necessary if you want to request an item instance, because rules tell web-poet which page object class to use to generate your item instance. Rules can also be useful as documentation or to get information about page object classes programmatically.
Rule precedence can also be useful. For example, to implement generic page object classes that you can override for specific websites.
Defining rules¶
The handle_urls()
decorator is the simplest way to define a rule for
a page object. For example:
from web_poet import ItemPage, handle_urls
from my_items import MyItem
@handle_urls("example.com")
class MyPage(ItemPage[MyItem]):
...
The code above tells web-poet to use the MyPage
page object class when given a URL pointing to the example.com
domain
name and being asked for MyPage
or MyItem
as output class.
Alternatively, you can manually create and register ApplyRule
objects:
from url_matcher import Patterns
from web_poet import ApplyRule, ItemPage, default_registry
from my_items import MyItem
class MyPage(ItemPage[MyItem]):
...
rule = ApplyRule(
for_patterns=Patterns(include=['example.com']),
use=MyPage,
to_return=MyItem,
)
default_registry.add_rule(rule)
URL patterns¶
Every rule defines a url_matcher.Patterns
object that determines if
any given URL is a match for the rule.
Patterns
objects offer a simple but powerful syntax for
URL matching. For example:
Pattern |
Behavior |
---|---|
(empty string) |
Matches any URL |
example.com |
Matches any URL on the example.com domain and subdomains |
example.com/products/ |
Matches example.com URLs under the /products/ path |
example.com?productId=* |
Matches example.com URLs with productId=… in their query string |
For details and more examples, see the url-matcher documentation.
When using the handle_urls()
decorator, its include
, exclude
,
and priority
parameters are used to create a Patterns
object. When creating an ApplyRule
object manually, you must create
a Patterns
object yourself and pass it to the
for_patterns
parameter of ApplyRule
.
Rule precedence¶
Often you define rules so that a given user input, i.e. a combination of a target URL and an output class, can only match 1 rule. However, there are scenarios where it can be useful to define 2 or more rules that can all match a given user input.
For example, you might want to define a “generic” page object class with some default implementation of field extraction, e.g. based on semantic markup or machine learning, and be able to override it based on the input URL, e.g. for specific websites or URL patterns, with a more specific page object class.
For a given user input, when 2 or more rules are a match, web-poet breaks the tie as follows:
One rule can indicate that its page object class overrides another page object class.
This is specified by
ApplyRule.instead_of
. When using thehandle_urls()
decorator, the value comes from theinstead_of
parameter of the decorator.For example, the following page object class would override
MyPage
from above:@handle_urls("example.com", instead_of=MyPage) class OverridingPage(ItemPage[MyItem]): ...
That is:
If the requested output class is
MyPage
, an instance ofOverridingPage
is returned instead.If the requested output class is
MyItem
, an instance ofOverridingPage
is created, and used to build an instance ofMyItem
, which is returned.
One rule can declare a higher priority than another rule, taking precedence.
Rule priority is determined by the value of
ApplyRule.for_patterns.priority
. When using thehandle_urls()
decorator, the value comes from thepriority
parameter of the decorator. Rule priority is 500 by default.For example, given the following page object class:
@handle_urls("example.com", priority=510) class PriorityPage(ItemPage[MyItem]): ...
The following would happen:
If the requested output class is
MyItem
, an instance ofPriorityPage
is created, and used to build an instance ofMyItem
, which is returned.If the requested output class is
MyPage
, an instance ofMyPage
is returned, sincePriorityPage
is not defined as an override forMyPage
.
instead_of
triumphs priority
: If a rule overrides another rule using
instead_of
, it does not matter if the overridden rule had a higher
priority.
When multiple rules override the same page object class, through, priority
can break the tie.
If none of those tie breakers are in place, the first rule added to the
registry takes precedence. However, relying on registration order is
discouraged, and you will get a warning if you register 2 or more rules with
the same URL patterns, same output item class, same priority, and no
instead_of
value. See also Rule conflicts.
Rule registries¶
Rules should be stored in a RulesRegistry
object.
web-poet defines a default, global RulesRegistry
object at
web_poet.default_registry
. Rules defined with the handle_urls()
decorator are added to this registry.
Loading rules¶
For a framework to apply your rules, you need to make sure
that your code that adds those rules to web_poet.default_registry
is
executed.
When using the handle_urls()
decorator, that usually means that
you need to make sure that Python imports the files where the decorator is
used.
You can use the consume_modules()
function in some entry
point of your code for that:
from web_poet import consume_modules
consume_modules("my_package.pages", "external_package.pages")
The ideal location for this function depends on your framework. Check the documentation of your framework for more information.
Rule conflicts¶
A rule conflict occurs when multiple rules have the same instead_of
and
priority
values and can match the same URL.
When it affects rules defined in your code base, solve the conflict adjusting
those instead_of
and priority
values as needed.
When it affects rules from a external package, you have the following options to solve the conflict:
Subclass one of the conflicting page object classes in your code base, using a similar rule except for a tie-breaking change to its
instead_of
orpriority
value.For example, if
package1.A
andpackage2.B
are page object classes with conflicting rules, with a default priority (500), and you wantpackage1.A
to take precedence, declare a new page object class as follows:from package1 import A from web_poet import handle_urls @handle_urls(..., priority=510) class NewA(A): pass
If your framework allows defining a custom list of rules, you could use
web_poet.default_registry
methods likeget_rules()
orsearch()
to build such a list, including only rules that have no conflicts.
Fields¶
A field is a read-only property in a page object class
decorated with @field
instead of
@property
.
Each field is named after a key of the item that the page object class returns. A field uses the inputs of its page object class to return the right value for the matching item key.
For example:
from typing import Optional
import attrs
from web_poet import ItemPage, HttpResponse, field
@attrs.define
class MyPage(ItemPage):
response: HttpResponse
@field
def foo(self) -> Optional[str]:
return self.response.css(".foo").get()
Synchronous and asynchronous fields¶
Fields can be either synchronous (def
) or asynchronous (async def
).
Asynchronous fields make sense, for example, when sending additional requests:
from typing import Optional
import attrs
from web_poet import ItemPage, HttpClient, HttpResponse, field
@attrs.define
class MyPage(ItemPage):
response: HttpResponse
http: HttpClient
@field
def name(self) -> Optional[str]:
return self.response.css(".name").get()
@field
async def price(self) -> Optional[str]:
resp = await self.http.get("...")
return resp.json().get("price")
Unlike the values of synchronous fields, the values of asynchronous fields need to be awaited:
page = MyPage(...)
name = page.name
price = await page.price
Mixing synchronous and asynchronous fields can be messy:
You need to know whether a field is synchronous or asynchronous to write the right code to read its value.
If a field changes from synchronous to asynchronous or vice versa, calls that read the field need to be updated.
Changing from synchronous to asynchronous might be sometimes necessary due to website changes (e.g. needing additional requests).
To address these issues, use ensure_awaitable()
to read both
synchronous and asynchronous fields with the same code:
from web_poet.utils import ensure_awaitable
page = MyPage(...)
name = await ensure_awaitable(page.name)
price = await ensure_awaitable(page.price)
Note
Using asynchronous fields only also works, but prevents accessing other fields from field processors.
Inheritance¶
To create a page object class that is very similar to another, subclassing the former page object class is often a good approach to maximize code reuse.
In a subclass of a page object class you can reimplement fields, add fields, remove fields, or rename fields.
Reimplementing a field¶
Reimplementing a field when subclassing a page object class should be straightforward:
import attrs
from web_poet import field, ensure_awaitable
from my_library import BasePage
@attrs.define
class CustomPage(BasePage):
@field
async def foo(self) -> str:
base_foo = await ensure_awaitable(super().foo)
return f"{base_foo} (modified)"
Adding a field¶
To add a new field to a page object class when subclassing:
Define a new item class that includes the new field, for example a subclass of the item class returned by the original page object class.
In your new page object class, subclass both the original page object class and
Returns
, the latter including the new item class between brackets.Implement the extraction code for the new field in the new page object class.
For example:
import attrs
from web_poet import field, Returns
from my_library import BasePage, BaseItem
@attrs.define
class CustomItem(BaseItem):
new_field: str
@attrs.define
class CustomPage(BasePage, Returns[CustomItem]):
@field
def new_field(self) -> str:
...
Removing a field¶
To remove a field from a page object class when subclassing:
Define a new item class that defines all fields but the one being removed.
In your new page object class, subclass the original page object class,
Returns
with the new item class between brackets, and setskip_nonitem_fields=True
.When building an item, page object class fields without a matching item class field will now be ignored, rather than raising an exception.
Your new page object class will still define the field, but the resulting item will not.
For example:
import attrs
from web_poet import Returns
from my_library import BasePage
@attrs.define
class CustomItem:
kept_field: str
@attrs.define
class CustomPage(BasePage, Returns[CustomItem], skip_nonitem_fields=True):
pass
Alternatively, you can consider using a page object as input for removing fields. It is more verbose than subclassing,
because you need to define every field in your page object class, but it can
catch some mismatches between page object class fields and item class fields
that would otherwise be hidden by skip_nonitem_fields
.
Renaming a field¶
To rename a field from a page object class when subclassing:
Define a new item class that defines all fields, including the renamed field.
In your new page object class, subclass the original page object class,
Returns
with the new item class between brackets, and setskip_nonitem_fields=True
.When building an item, page object class fields without a matching item class field will now be ignored, rather than raising an exception.
Define a field for the new field name that returns the value from the old field name.
Your new page object class will still define the old field name, but the resulting item will not.
For example:
import attrs
from web_poet import Returns
from my_library import BasePage
@attrs.define
class CustomItem:
new_field: str
@attrs.define
class CustomPage(BasePage, Returns[CustomItem], skip_nonitem_fields=True):
@field
async def new_field(self) -> str:
return ensure_awaitable(self.old_field)
Alternatively, you can consider using a page object as input for renaming fields. It is more verbose than subclassing,
because you need to define every field in your page object class, but it can
catch some mismatches between page object class fields and item class fields
that would otherwise be hidden by skip_nonitem_fields
.
Composition¶
There are 2 forms of composition that you can use when writing a page object: using a page object as input, and using a field mixing.
Using a page object as input¶
You can reuse a page object class from another page object class using composition instead of inheritance by using the original page object class as a dependency in a brand new page object class returning a brand new item class.
This is a good approach when you want to reuse code but the page object classes
are very different, or when you want to remove or rename fields without relying
on skip_nonitem_fields
.
For example:
import attrs
from web_poet import ItemPage, field, ensure_awaitable
from my_library import BasePage
@attrs.define
class CustomItem:
name: str
@attrs.define
class CustomPage(ItemPage[CustomItem]):
base: BasePage
@field
async def name(self) -> str:
name = await ensure_awaitable(self.base.name)
brand = await ensure_awaitable(self.base.brand)
return f"{brand}: {name}"
Instead of a page object, it is possible to declare the item it returns as a dependency in your new page object class. For example:
import attrs
from web_poet import ItemPage, field
from my_library import BaseItem
@attrs.define
class CustomItem:
name: str
@attrs.define
class CustomPage(ItemPage[CustomItem]):
base: BaseItem
@field
def name(self) -> str:
return f"{self.base.brand}: {self.base.name}"
This gives you the flexibility to use rules to set the page object class to use when building the item. Also, item fields can be read from synchronous methods even if the source page object fields were asynchronous.
On the other hand, all fields of the source page object class will always be called to build the entire item, which may be a waste of resources if you only need to access some of the item fields.
Field mixins¶
You can subclass web_poet.fields.FieldsMixin
to create a mixin to
reuse field definitions across multiple, otherwise-unrelated classes. For
example:
import attrs
from web_poet import ItemPage, field
from web_poet.fields import FieldsMixin
from my_library import BaseItem1, BaseItem2
@attrs.define
class CustomItem:
name: str
class NameMixin(FieldsMixin):
@field
def name(self) -> str:
return f"{self.base.brand}: {self.base.name}"
@attrs.define
class CustomPage1(NameMixin, ItemPage[CustomItem]):
base: BaseItem1
@attrs.define
class CustomPage2(NameMixin, ItemPage[CustomItem]):
base: BaseItem2
Field processors¶
It’s often needed to clean or process field values using reusable functions.
@field
takes an optional out
argument with
a list of such functions. They will be applied to the field value before
returning it:
from web_poet import ItemPage, HttpResponse, field
def clean_tabs(s: str) -> str:
return s.replace('\t', ' ')
def add_brand(s: str, page: ItemPage) -> str:
return f"{page.brand} - {s}"
class MyPage(ItemPage):
response: HttpResponse
@field(out=[clean_tabs, str.strip, add_brand])
def name(self) -> str:
return self.response.css(".name ::text").get() or ""
@field(cached=True)
def brand(self) -> str:
return self.response.css(".brand ::text").get() or ""
Accessing other fields from field processors¶
If a processor takes an argument named page
, that argument will contain the
page object instance. This allows processing a field differently based on the
values of other fields.
Be careful of circular references. Accessing a field runs its processors; if
two fields reference each other, RecursionError
will be raised.
You should enable caching for fields accessed in processors, to avoid unnecessary recomputation.
Processors can be applied to asynchronous fields, but processor functions must
be synchronous. As a result, only values of synchronous fields can be accessed
from processors through the page
argument.
Default processors¶
In addition to the out
argument of @field
,
you can define processors at the page object class level by defining a nested
class named Processors
:
import attrs
from web_poet import ItemPage, HttpResponse, field
def clean_tabs(s: str) -> str:
return s.replace('\t', ' ')
@attrs.define
class MyPage(ItemPage):
response: HttpResponse
class Processors:
name = [clean_tabs, str.strip]
@field
def name(self) -> str:
return self.response.css(".name ::text").get() or ""
If Processors
contains an attribute with the same name as a field, the
value of that attribute is used as a list of default processors for the field,
to be used if the out
argument of @field
is
not defined.
You can also reuse and extend the processors defined in a base class by
explicitly accessing or subclassing the Processors
class:
import attrs
from web_poet import ItemPage, HttpResponse, field
def clean_tabs(s: str) -> str:
return s.replace('\t', ' ')
@attrs.define
class MyPage(ItemPage):
response: HttpResponse
class Processors:
name = [str.strip]
@field
def name(self) -> str:
return self.response.css(".name ::text").get() or ""
class MyPage2(MyPage):
class Processors(MyPage.Processors):
# name uses the processors in MyPage.Processors.name
# description now also uses them and also clean_tabs
description = MyPage.Processors.name + [clean_tabs]
@field
def description(self) -> str:
return self.response.css(".description ::text").get() or ""
# brand uses the same processors as name
@field(out=MyPage.Processors.name)
def brand(self) -> str:
return self.response.css(".brand ::text").get() or ""
Processors for nested fields¶
Some item fields contain nested items (e.g. a product can contain a list of variants) and it’s useful to have processors for fields of these nested items.
You can use the same logic for them as for normal fields if you define an
extractor class that produces these nested items. Such classes should inherit
from Extractor
.
In the simplest cases you need to pass a selector to them:
from typing import Any, Dict, List
import attrs
from parsel import Selector
from web_poet import Extractor, ItemPage, HttpResponse, field
@attrs.define
class MyPage(ItemPage):
response: HttpResponse
@field
async def variants(self) -> List[Dict[str, Any]]:
variants = []
for color_sel in self.response.css(".color"):
variant = await VariantExtractor(color_sel).to_item()
variants.append(variant)
return variants
@attrs.define
class VariantExtractor(Extractor):
sel: Selector
@field(out=[str.strip])
def color(self) -> str:
return self.sel.css(".name::text").get() or ""
In such cases you can also use SelectorExtractor
as a shortcut that
provides css()
and xpath()
:
class VariantExtractor(SelectorExtractor):
@field(out=[str.strip])
def color(self) -> str:
return self.css(".name::text").get() or ""
You can also pass other data in addition to, or instead of, selectors, such as dictionaries with some data:
@attrs.define
class VariantExtractor(Extractor):
variant_data: dict
@field(out=[str.strip])
def color(self) -> str:
return self.variant_data.get("color") or ""
Field caching¶
When writing extraction code for Page Objects, it’s common that several attributes reuse some computation. For example, you might need to do an additional request to get an API response, and then fill several attributes from this response:
from typing import Dict, Optional
from web_poet import ItemPage, HttpResponse, HttpClient, validates_input
class MyPage(ItemPage):
response: HttpResponse
http: HttpClient
@validates_input
async def to_item(self) -> Dict[str, Optional[str]]:
api_url = self.response.css("...").get()
api_response = await self.http.get(api_url).json()
return {
'name': self.response.css(".name ::text").get(),
'price': api_response.get("price"),
'sku': api_response.get("sku"),
}
When converting such Page Objects to use fields, be careful not to make an API call (or some other heavy computation) multiple times. You can do it by extracting the heavy operation to a method, and caching the results:
from typing import Dict
from web_poet import ItemPage, HttpResponse, HttpClient, field, cached_method
class MyPage(ItemPage):
response: HttpResponse
http: HttpClient
@cached_method
async def api_response(self) -> Dict[str, str]:
api_url = self.response.css("...").get()
return await self.http.get(api_url).json()
@field
def name(self) -> str:
return self.response.css(".name ::text").get() or ""
@field
async def price(self) -> str:
api_response = await self.api_response()
return api_response.get("price") or ""
@field
async def sku(self) -> str:
api_response = await self.api_response()
return api_response.get("sku") or ""
As you can see, web-poet
provides cached_method()
decorator,
which allows to memoize the function results. It supports both sync and
async methods, i.e. you can use it on regular methods (def foo(self)
),
as well as on async methods (async def foo(self)
).
The refactored example, with per-attribute fields, is more verbose than
the original one, where a single to_item
method is used. However, it
provides some advantages — if only a subset of attributes is needed, then
it’s possible to use the Page Object without doing unnecessary work.
For example, if user only needs name
field in the example above, no
additional requests (API calls) will be made.
Sometimes you might want to cache a @field
, i.e. a property which computes
an attribute of the final item. In such cases, use @field(cached=True)
decorator instead of @field
.
cached_method
vs lru_cache
vs cached_property
¶
If you’re an experienced Python developer, you might wonder why is
cached_method()
decorator needed, if Python already provides
functools.lru_cache()
. For example, one can write this:
from functools import lru_cache
from web_poet import ItemPage
class MyPage(ItemPage):
# ...
@lru_cache
def heavy_method(self):
# ...
Don’t do it! There are two issues with functools.lru_cache()
, which make
it unsuitable here:
It doesn’t work properly on methods, because
self
is used as a part of the cache key. It means a reference to an instance is kept in the cache, and so created page objects are never deallocated, causing a memory leak.functools.lru_cache()
doesn’t work onasync def
methods, so you can’t cache e.g. results of API calls usingfunctools.lru_cache()
.
cached_method()
solves both of these issues. You may also use
functools.cached_property()
, or an external package like async_property
with async versions of @property
and @cached_property
decorators; unlike
functools.lru_cache()
, they all work fine for this use case.
Exception caching¶
Note that exceptions are not cached - neither by cached_method()
,
nor by @field(cached=True), nor by functools.lru_cache()
, nor by
functools.cached_property()
.
Usually it’s not an issue, because an exception is usually propagated, and so there are no duplicate calls anyways. But, just in case, keep this in mind.
Field metadata¶
web-poet
allows to store arbitrary information for each field using the
meta
keyword argument:
from web_poet import ItemPage, field
class MyPage(ItemPage):
@field(meta={"expensive": True})
async def my_field(self):
...
To retrieve this information, use web_poet.fields.get_fields_dict()
; it
returns a dictionary, where keys are field names, and values are
web_poet.fields.FieldInfo
instances.
from web_poet.fields import get_fields_dict
fields_dict = get_fields_dict(MyPage)
field_names = fields_dict.keys()
my_field_meta = fields_dict["my_field"].meta
print(field_names) # dict_keys(['my_field'])
print(my_field_meta) # {'expensive': True}
Input validation¶
Input validation, if used, happens before field evaluation, and it may override the values of fields, preventing field evaluation from ever happening. For example:
class Page(ItemPage[Item]):
def validate_input(self) -> Item:
return Item(foo="bar")
@field
def foo(self):
raise RuntimeError("This exception is never raised")
assert Page().foo == "bar"
Field evaluation may still happen for a field if the field is used in the
implementation of the validate_input
method. Note, however, that only
synchronous fields can be used from the validate_input
method.
Additional requests¶
Some websites require page interactions to load some information, such as clicking a button, scrolling down or hovering on some element. These interactions usually trigger background requests that are then loaded using JavaScript.
To extract such data, reproduce those requests using HttpClient
.
Include HttpClient
among the inputs of your
page object, and use an asynchronous field or method to call one of its methods.
For example, simulating a click on a button that loads product images could look like:
import attrs
from web_poet import HttpClient, HttpError, field
from zyte_common_items import Image, ProductPage
@attrs.define
class MyProductPage(ProductPage]):
http: HttpClient
@field
def productId(self):
return self.css("::attr(product-id)").get()
@field
async def images(self):
url = f"https://api.example.com/v2/images?id={self.productId}"
try:
response = await self.http.get(url)
except HttpError:
return []
else:
urls = response.css(".product-images img::attr(src)").getall()
return [Image(url=url) for url in urls]
Warning
HttpClient
should only be used to handle the type of scenarios
mentioned above. Using HttpClient
for crawling logic would
defeat the purpose of web-poet.
Making a request¶
HttpClient
provides multiple asynchronous request methods, such as:
http = HtpClient()
response = await http.get(url)
response = await http.post(url, body=b"...")
response = await http.request(url, method="...")
response = await http.execute(HttpRequest(url, method="..."))
Request methods also accept custom headers and body, for example:
http.post(
url,
headers={"Content-Type": "application/json;charset=UTF-8"},
body=json.dumps({"foo": "bar"}).encode("utf-8"),
)
Request methods may either raise an HttpError
or return an
HttpResponse
. See Working with HttpResponse.
Note
HttpClient
methods are expected to follow any redirection except
when the request method is HEAD
. This means that the
HttpResponse
that you get is already the end of any redirection
trail.
Concurrent requests¶
To send multiple requests concurrently, use HttpClient.batch_execute
, which accepts any number of
HttpRequest
instances as input, and returns HttpResponse
instances (and HttpError
instances when using
return_exceptions=True
) in the input order. For example:
import attrs
from web_poet import HttpClient, HttpError, HttpRequest, field
from zyte_common_items import Image, ProductPage, ProductVariant
@attrs.define
class MyProductPage(ProductPage):
http: HttpClient
max_variants = 10
@field
def productId(self):
return self.css("::attr(product-id)").get()
@field
async def variants(self):
requests = [
HttpRequest(f"https://example.com/api/variant/{self.productId}/{index}")
for index in range(self.max_variants)
]
responses = await self.http.batch_execute(*requests, return_exceptions=True)
return [
ProductVariant(color=response.css("::attr(color)").get())
for response in responses
if not isinstance(response, HttpError)
]
You can alternatively use asyncio
together with HttpClient
to
handle multiple requests. For example, you can use asyncio.as_completed()
to process the first response from a group of requests as early as possible.
Error handling¶
HttpClient
methods may raise an exception of type
HttpError
or a subclass.
If the response HTTP status code (response.status
) is 400 or higher, HttpResponseError
is
raised. In case of connection errors, TLS errors and similar,
HttpRequestError
is raised.
HttpError
provides access to the offending
request
, and HttpResponseError
also provides
access to the offending response
.
Retrying additional requests¶
Input validation allows retrying all inputs from a page object. To retry only additional requests, you must handle retries on your own.
Your code is responsible for retrying additional requests until good response data is received, or until some maximum number of retries is exceeded.
It is up to you to decide what the maximum number of retries should be for a given additional request, based on your experience with the target website.
It is also up to you to decide how to implement retries of additional requests.
One option would be tenacity. For example, to try an additional request 3 times before giving up:
import attrs
from tenacity import retry, stop_after_attempt
from web_poet import HttpClient, HttpError, field
from zyte_common_items import ProductPage
@attrs.define
class MyProductPage(ProductPage):
http: HttpClient
@field
def productId(self):
return self.css("::attr(product-id)").get()
@retry(stop=stop_after_attempt(3))
async def get_images(self):
return self.http.get(f"https://api.example.com/v2/images?id={self.productId}")
@field
async def images(self):
try:
response = await self.get_images()
except HttpError:
return []
else:
urls = response.css(".product-images img::attr(src)").getall()
return [Image(url=url) for url in urls]
If the reason your additional request fails is outdated or missing data from page object input, do not try to reproduce the request for that input as an additional request. Request fresh input instead.
Input validation¶
Sometimes the data that your page object receives as input may be invalid.
You can define a validate_input
method in a page object class to check its
input data and determine how to handle invalid input.
validate_input
is called on the first execution of ItemPage.to_item()
or the first access to a field. In both cases validation
happens early; in the case of fields, it happens before field evaluation.
validate_input
is a synchronous method that expects no parameters, and its
outcome may be any of the following:
Return
None
, indicating that the input is valid.
Raise
Retry
, indicating that the input looks like the result of a temporary issue, and that trying to fetch similar input again may result in valid input.See also Retrying additional requests.
Raise
UseFallback
, indicating that the page object does not support the input, and that an alternative parsing implementation should be tried instead.For example, imagine you have a page object for website commerce.example, and that commerce.example is built with a popular e-commerce web framework. You could have a generic page object for products of websites using that framework,
FrameworkProductPage
, and a more specific page object for commerce.example,EcommerceExampleProductPage
. IfEcommerceExampleProductPage
cannot parse a product page, but it looks like it might be a valid product page, you would raiseUseFallback
to try to parse the same product page withFrameworkProductPage
, in case it works.Note
web-poet does not dictate how to define or use an alternative parsing implementation as fallback. It is up to web-poet frameworks to choose how they implement fallback handling.
Return an item to override the output of the
to_item
method and of fields.For input not matching the expected type of data, returning an item that indicates so is recommended.
For example, if your page object parses an e-commerce product, and the input data corresponds to a list of products rather than a single product, you could return a product item that somehow indicates that it is not a valid product item, such as
Product(is_valid=False)
.
For example:
def validate_input(self):
if self.css('.product-id::text') is not None:
return
if self.css('.http-503-error'):
raise Retry()
if self.css('.product'):
raise UseFallback()
if self.css('.product-list'):
return Product(is_valid=False)
You may use fields in your implementation of the validate_input
method, but
only synchronous fields are supported. For example:
class Page(WebPage[Item]):
def validate_input(self):
if not self.name:
raise UseFallback()
@field(cached=True)
def name(self):
return self.css(".product-name ::text")
Tip
Cache fields used in the validate_input
method, so that when they are used from to_item
they are not
evaluated again.
If you implement a custom to_item
method, as long as you are inheriting
from ItemPage
, you can enable input validation
decorating your custom to_item
method with
validates_input()
:
from web_poet import validates_input
class Page(ItemPage[Item]):
@validates_input
async def to_item(self):
...
Retry
and UseFallback
may also be raised from the to_item
method. This could come in handy, for
example, if after you execute some asynchronous code, such as an
additional request, you find out that you need to
retry the original request or use a fallback.
Input Validation Exceptions¶
- exception web_poet.exceptions.PageObjectAction[source]¶
Base class for exceptions that can be raised from a page object to indicate something to be done about that page object.
Using page params¶
In some cases, page object classes might require or allow parameters from the calling code, e.g. to change their behavior or make optimizations.
To support parameters, add PageParams
to your inputs:
import attrs
from web_poet import PageParams, WebPage
@attrs.define
class MyPage(WebPage):
page_params: PageParams
In your page object class, you can read parameters from a PageParams
object as you would from a dict
:
foo = self.page_params["foo"]
bar = self.page_params.get("bar", "default")
The way the calling code sets those parameters depends on your web-poet framework.
Example: Controlling item values¶
import attrs
import web_poet
from web_poet import validates_input
@attrs.define
class ProductPage(web_poet.WebPage):
page_params: web_poet.PageParams
default_tax_rate = 0.10
@validates_input
def to_item(self):
item = {
"url": self.url,
"name": self.css("#main h3.name ::text").get(),
"price": self.css("#main .price ::text").get(),
}
self.calculate_price_with_tax(item)
return item
@staticmethod
def calculate_price_with_tax(item):
tax_rate = self.page_params.get("tax_rate", self.default_tax_rate)
item["price_with_tax"] = item["price"] * (1 + tax_rate)
From the example above, we were able to provide an optional information regarding
the tax rate of the product. This could be useful when trying to support
the different tax rates for each state or territory. However, since we’re treating
the tax_rate as optional information, notice that we also have a the
default_tax_rate
as a backup value just in case it’s not available.
Example: Controlling page object behavior¶
Let’s try an example wherein PageParams
is able to control how
additional requests are being used. Specifically,
we are going to use PageParams
to control the number of pages
visited.
from typing import List
import attrs
import web_poet
from web_poet import validates_input
@attrs.define
class ProductPage(web_poet.WebPage):
http: web_poet.HttpClient
page_params: web_poet.PageParams
default_max_pages = 5
@validates_input
async def to_item(self):
return {"product_urls": await self.get_product_urls()}
async def get_product_urls(self) -> List[str]:
# Simulates scrolling to the bottom of the page to load the next
# set of items in an "Infinite Scrolling" category list page.
max_pages = self.page_params.get("max_pages", self.default_max_pages)
requests = [
self.create_next_page_request(page_num)
for page_num in range(2, max_pages + 1)
]
responses = await http.batch_execute(*requests)
return [
url
for response in responses
for product_urls in self.parse_product_urls(response)
for url in product_urls
]
@staticmethod
def create_next_page_request(page_num):
next_page_url = f"https://example.com/category/products?page={page_num}"
return web_poet.Request(url=next_page_url)
@staticmethod
def parse_product_urls(response: web_poet.HttpResponse):
return response.css("#main .products a.link ::attr(href)").getall()
From the example above, we can see how PageParams
is able to
arbitrarily limit the pagination behavior by passing an optional max_pages
info. Take note that a default_max_pages
value is also present in the page
object class in case the PageParams
instance did not provide it.
Stats¶
During parsing, storing some data about the parsing itself can be useful for
debugging, monitoring, and reporting. The Stats
page input allows
storing such data.
For example, you can use stats to track which parsing code is actually used, so that you can remove code once it is no longer necessary due to upstream changes:
from attrs import define
from web_poet import field, Stats, WebPage
@attrs.define
class MyPage(WebPage):
stats: Stats
@field
def title(self):
if title := self.css("h1::text").get():
self.stats.inc("MyPage/field-src/title/h1")
elif title := self.css("h2::text").get():
self.stats.inc("MyPage/field-src/title/h2")
return title
Tests for page objects¶
Page Objects that inherit from ItemPage
can be tested by saving the
dependencies needed to create one and the result of
to_item()
, recreating the Page Object from the
dependencies, running its to_item()
and
comparing the result to the saved one. web-poet
provides the following
tools for this:
dependency serialization into a Python object and into a set of files;
recreating Page Objects from the serialized dependencies;
a high-level function to save a test fixture;
a
pytest
plugin that discovers fixtures and runs tests for them.
Serialization¶
web_poet.serialization.serialize()
can be used to serialize an iterable
of Page Object dependencies to a Python object.
web_poet.serialization.deserialize()
can be used to recreate a Page
Object from this serialized data.
An instance of web_poet.serialization.SerializedDataFileStorage
can be
used to write the serialized data to a set of files in a given directory and to
read it back.
Note
We only support serializing dependencies, not Page Object instances, because the only universal way to recreate a Page Object is from its dependencies, not from some saved internal state.
Each dependency is serialized to one or several bytes
objects, each of
which is saved as a single file. web_poet.serialization.serialize_leaf()
and web_poet.serialization.deserialize_leaf()
are used to convert between
a dependency and this set of bytes
objects. They are implemented using
functools.singledispatch()
and while the types provided by web-poet
are supported out of the box, user-defined types need a pair of implementation
functions that need to be registered using
web_poet.serialization.register_serialization()
.
Fixtures¶
The provided pytest
plugin expects fixtures in a certain layout. A set of
fixtures for a single Page Object should be contained in a directory named as
that Page Object fully qualified class name. Each fixture is a directory inside
it, that contains data for Page Object inputs and output:
fixtures
└── my_project.pages.MyItemPage
├── test-1
│ ├── inputs
│ ├── HttpClient.exists
│ │ ├── HttpResponse-body.html
│ │ ├── HttpResponse-info.json
│ │ └── ResponseUrl.txt
│ ├── meta.json
│ └── output.json
└─── test-2
├── inputs
│ ├── HttpClient.exists
│ ├── HttpClient-0-HttpRequest.info.json
│ ├── HttpClient-0-HttpResponse.body.html
│ ├── HttpClient-0-HttpResponse.info.json
│ ├── HttpClient-1-HttpRequest.body.txt
│ ├── HttpClient-1-HttpRequest.info.json
│ ├── HttpClient-1-HttpResponse.body.html
│ ├── HttpClient-1-HttpResponse.info.json
│ ├── HttpResponse-body.html
│ ├── HttpResponse-info.json
│ └── ResponseUrl.txt
├── meta.json
└── output.json
web_poet.testing.Fixture.save()
can be used to create a fixture inside a
Page Object directory from an iterable of dependencies, an output item and an
optional metadata dictionary. It can optionally take a name for the fixture
directory. By default it uses incrementing names “test-1”, “test-2” etc.
Note
output.json
contains a result of page_object.to_item()
converted to
a dict using the itemadapter library and saved as JSON.
After generating a fixture you can edit output.json
to modify expected
field values and add new fields, which is useful when creating tests for code
that isn’t written yet or before modifying its behavior.
scrapy-poet integration¶
Projects that use the scrapy-poet library can use the Scrapy command provided by it to generate fixtures in a convenient way. It’s available starting with scrapy-poet 0.8.0.
Running tests¶
The provided pytest
plugin is automatically registered when web-poet
is
installed, and running python -m pytest
in a directory containing fixtures
will discover them and run tests for them.
By default, the plugin generates:
a test which checks that
to_item()
doesn’t raise an exception (i.e. it can be executed),a test per each output attribute of the item,
an additional test to check that there are no extra attributes in the output.
For example, if your item has 5 attributes, and you created 2 fixtures, pytest will run (5+1+1)*2 = 14 tests. This allows to report failures for individual fields separately.
If to_item
raises an error, there is no point in running other tests,
so they’re skipped in this case.
If you prefer less granular test failure reporting, you can use pytest with
the --web-poet-test-per-item
option:
python -m pytest --web-poet-test-per-item
In this case there is going to be a single test per fixture: if the result is not fully correct, the test fails. So, following the previous example, it’d be 2 tests instead of 14.
Test-Driven Development¶
You can follow TDD (Test-Driven Development) approach to develop your page objects. To do so,
Generate a fixture (see scrapy-poet integration).
Populate
output.json
with the correct expected output.Run the tests (see Running tests) and update the code until all tests pass. It’s convenient to use web-poet Fields, and implement extraction field-by-field, because you’ll be getting an additional test passing after each field is implemented.
This approach allows a fast feedback loop: there is no need to download page multiple times, and you have a clear progress indication for your work (number of failing tests remaining). Also, in the end you get a regression test, which can be helpful later.
Sometimes it may be awkward to set the correct value in JSON before starting the development, especially if a value is large or has a complex structure. For example, this could be the case for e-commerce product description field, which can be hard to copy-paste from the website, and which may have various whitespace normalization rules which you need to apply.
In this case, it may be more convenient to implement the extraction first,
and only then populate the output.json
file with the correct value.
You can use python -m web_poet.testing rerun <fixture_path>
command
in this case, to re-run the page object using the inputs saved in a fixture.
This command prints output of the page object, as JSON; you can then copy-paste
relevant parts to the output.json
file. It’s also possible to make
the command print only some of the fields. For example, you might run the
following command after implementing extraction for “description” and
“descriptionHtml” fields in my_project.pages.MyItemPage
:
python -m web_poet.testing rerun \
fixtures/my_project.pages.MyItemPage/test-1 \
--fields description,descriptionHtml
It may output something like this:
{
"description": "..description of the product..",
"descriptionHtml": "<p>...</p>"
}
If these values look good, you can update
fixtures/my_project.pages.MyItemPage/test-1/output.json
file
with these values.
Handling time fields¶
Sometimes output of a page object might depend on the current time. For
example, the item may contain the scraping datetime, or a current timestamp may
be used to build some URLs. When a test runs at a different time it will break.
To avoid this the metadata dictionary can contain a
frozen_time
field set to the time value used when generating the test. This
will instruct the test runner to use the same time value so that field
comparisons are still correct.
The value can be any string understood by dateutil. If it doesn’t include timezone information, the local time of the machine will be assumed. If it includes timezone information, on non-Windows systems the test process will be executed in that timezone, so that output fields that contain local time are correct. On Windows systems (where changing the process timezone is not possible) the time value will be converted to the local time of the machine, and such fields will containt wrong data if these timezones don’t match. Consider an example item:
import datetime
from web_poet import WebPage, validates_input
class DateItemPage(WebPage):
@validates_input
async def to_item(self) -> dict:
# e.g. 2001-01-01 11:00:00 +00
now = datetime.datetime.now(datetime.timezone.utc)
return {
# '2001-01-01T11:00:00Z'
"time_utc": now.strftime("%Y-%M-%dT%H:%M:%SZ"),
# if the current timezone is CET, then '2001-01-01T12:00:00+01:00'
"time_local": now.astimezone().strftime("%Y-%M-%dT%H:%M:%S%z"),
}
We will assume that the fixture was generated in CET (UTC+1).
If the fixture doesn’t have the
frozen_time
metadata field, the item will simply contain the current time and the test will always fail.If
frozen_time
doesn’t contain the timezone data (e.g. it is2001-01-01T11:00:00
), the item will depend on the machine timezone: in CET it will contain the expected values, in timezones with a different offsettime_local
will be different.If
frozen_time
contains the timezone data and the system is not Windows, thetime_local
field will contain the date in that timezone, so if the timezone infrozen_time
is not UTC+1, the test will fail.If the system is Windows, the
frozen_time
value will be converted to the machine timezone, so the item will depend on that timezone, just like whenfrozen_time
doesn’t contain the timezone data, andtime_local
will similarly be only correct if the machine timezone has the same offset as CET.
This means that most combinations of setups will work if frozen_time
contains the timezone data, except for running tests on Windows, in which case
the machine timezone should match the timezone in frozen_time
. Also, if
items do not depend on the machine timezone (e.g. if all datetime-derived data
they contain is in UTC), the tests for them should work everywhere.
There is an additional limitation which we plan to fix in future versions. The
time is set to the frozen_time
value when the test generation (if using the
scrapy-poet
command) or the test run starts, but it ticks during the
generation/run itself, so if it takes more than 1 second (which is quite
possible even in simple cases) the time fields will have values several seconds
later than frozen_time
. For now we recommend to work around this problem by
manually editing the output.json
file to put the value equal to
frozen_time
in these fields, as running the test shoudn’t take more than 1
second.
Storing fixtures in Git¶
Fixtures can take a lot of disk space, as they usually include page responses and may include other large files, so we recommend using Git LFS when storing them in Git repos to reduce the repo space and get other performance benefits. Even if your fixtures are currently small, it may be useful to do this from the beginning, as migrating files to LFS is not easy and requires rewriting the repo history.
To use Git LFS you need a Git hosting provider that supports it, and major providers and software (e.g. GitHub, Bitbucket, GitLab) support it. There are also implementations for standalone Git servers.
Assuming you store the fixtures in the directory named “fixtures” in the repo root, the workflow should be as following. Enable normal diffs for LFS files in this repo:
git config diff.lfs.textconv cat
Enable LFS for the fixtures directory before committing anything in it:
git lfs track "fixtures/**"
Commit the .gitattributes
file (which stores the tracking information):
git add .gitattributes
git commit
After generating the fixtures just commit them as usual:
git add fixtures/test-1
git commit
After this all usual commands including push
, pull
or checkout
should work as expected on these files.
Please also check the official Git LFS documentation for more information.
Additional requests support¶
If the page object uses the HttpClient
dependency to make
additional requests, the generated fixtures will
contain these requests and their responses (or exceptions raised when the
response is not received). When the test runs, HttpClient
will
return the saved responses without doing actual requests.
Currently requests are compared by their URL, method, headers and body, so if a page object makes requests that differ between runs, the test won’t be able to find a saved response and will fail.
Test coverage¶
The coverage for page object code is reported correctly if tools such as coverage are used when running web-poet tests.
Item adapters¶
The testing framework uses the itemadapter library to convert items to dicts when storing them in fixtures and when comparing the expected and the actual output. As adapters may influence the resulting dicts, it’s important to use the same adapter when generating and running the tests.
It may also be useful to use different adapters in tests and in production. For example, you may want to omit empty fields in production, but be able to distinguish between empty and absent fields in tests.
For this you can set the adapter
field in the metadata dictionary to the class that inherits from
itemadapter.ItemAdapter
and has the adapter(s) you want to use in
tests in its ADAPTER_CLASSES
attribute (see the relevant itemadapter
docs for more information). An example:
from collections import deque
from itemadapter import ItemAdapter
from itemadapter.adapter import DictAdapter
class MyAdapter(DictAdapter):
# any needed customization
...
class MyItemAdapter(ItemAdapter):
ADAPTER_CLASSES = deque([MyAdapter])
You can then put the MyItemAdapter
class object into adapter
and it
will be used by the testing framework.
If adapter
is not set,
WebPoetTestItemAdapter
will be used.
It works like itemadapter.ItemAdapter
but doesn’t change behavior when
itemadapter.ItemAdapter.ADAPTER_CLASSES
is modified.
Frameworks¶
Page objects are not meant to be used in isolation with web-poet. They are meant to be used with a web-poet framework.
A web-poet framework is a Python web scraping framework, library, or plugin that implements the web-poet specification.
At the moment, the only production-ready web-poet framework that exists is scrapy-poet, which brings web-poet support to Scrapy.
As web-poet matures and sees wider adoption, we hope to see more frameworks add support for it.
Framework specification¶
Learn how to build a web-poet framework.
Design principles¶
Page objects should be flexible enough to be used with:
synchronous or asynchronous code, callback-based and
async def / await
based,single-node and distributed systems,
different underlying HTTP implementations - or without HTTP support at all, etc.
Minimum requirements¶
A web-poet framework must support building a page object given a page object class.
It must be able to build input objects for a page object based on type hints on the page object class, i.e. dependency injection, and additional input data required by those input objects, such as a target URL or a dictionary of page parameters.
You can implement dependency injection with the andi library, which handles
signature inspection, Optional
and Union
annotations, as well as indirect dependencies. For practical examples, see the
source code of scrapy-poet and of the web_poet.example
module.
Additional features¶
To provide a better experience to your users, consider extending your web-poet framework further to:
Support as many input classes from the
web_poet.page_inputs
module as possible.Support returning a page object given a target URL and a desired output item class, determining the right page object class to use based on rules.
Allow users to request an output item directly, instead of requesting a page object just to call its
to_item
method.If you do, consider supporting both synchronous and asynchronous definitions of the
to_item
method, e.g. usingensure_awaitable()
.Support additional requests.
Support retries.
Let users set their own rules, e.g. to solve conflicts.
Supporting rules¶
Ideally, a framework should support returning the right page object or output item given a target URL and a desired output item class when rules are used.
To provide basic support for rules in your framework, use the
RulesRegistry
object at web_poet.default_registry
to choose
a page object based on rules:
from web_poet import default_registry
page_cls = default_registry.page_cls_for_item("https://example.com", MyItem)
You should also let your users know what is the best approach to load
rules when using your framework. For example, let them know the
best location for their calls to the consume_modules()
function.
Supporting additional requests¶
To support additional requests, your framework must
provide the request download implementation of HttpClient
.
Providing the Downloader¶
On its own, HttpClient
doesn’t do anything. It doesn’t
know how to execute the request on its own. Thus, for frameworks or projects
wanting to use additional requests in Page Objects, they need to set the
implementation on how to execute an HttpRequest
.
For more info on this, kindly read the API Specifications for HttpClient
.
In any case, frameworks that wish to support web-poet could provide the HTTP downloader implementation in two ways:
1. Context Variable¶
contextvars
is natively supported in asyncio
in order to set and
access context-aware values. This means that the framework using web-poet
can assign the request downloader implementation using the contextvars
instance named web_poet.request_downloader_var
.
This can be set using:
import attrs
import web_poet
from web_poet import validates_input
async def request_implementation(req: web_poet.HttpRequest) -> web_poet.HttpResponse:
...
def create_http_client():
return web_poet.HttpClient()
@attrs.define
class SomePage(web_poet.WebPage):
http: web_poet.HttpClient
@validates_input
async def to_item(self):
...
# Once this is set, the ``request_implementation`` becomes available to
# all instances of HttpClient, unless HttpClient is created with
# the ``request_downloader`` argument (see the #2 Dependency Injection
# example below).
web_poet.request_downloader_var.set(request_implementation)
# Assume that it's constructed with the necessary arguments taken somewhere.
response = web_poet.HttpResponse(...)
page = SomePage(response=response, http=create_http_client())
item = await page.to_item()
When the web_poet.request_downloader_var
contextvar is set,
HttpClient
instances use it by default.
Warning
If no value for web_poet.request_downloader_var
is set, then
RequestDownloaderVarError
is raised. However, no exception is
raised if option 2 below is used.
2. Dependency Injection¶
The framework using web-poet may be using libraries that don’t
have a full support to contextvars
(e.g. Twisted). With that, an
alternative approach would be to supply the request downloader implementation
when creating an HttpClient
instance:
import attrs
import web_poet
from web_poet import validates_input
async def request_implementation(req: web_poet.HttpRequest) -> web_poet.HttpResponse:
...
def create_http_client():
return web_poet.HttpClient(request_downloader=request_implementation)
@attrs.define
class SomePage(web_poet.WebPage):
http: web_poet.HttpClient
@validates_input
async def to_item(self):
...
# Assume that it's constructed with the necessary arguments taken somewhere.
response = web_poet.HttpResponse(...)
page = SomePage(response=response, http=create_http_client())
item = await page.to_item()
From the code sample above, we can see that every time an HttpClient
instance is created for Page Objects needing it, the framework
must create HttpClient
with a framework-specific request
downloader implementation, using the request_downloader
argument.
Downloader Behavior¶
The request downloader MUST accept an instance of HttpRequest
as the input and return an instance of HttpResponse
. This is important
in order to handle and represent generic HTTP operations. The only time that
it won’t be returning HttpResponse
would be when it’s raising exceptions
(see Exception Handling).
The request downloader MUST resolve Location-based redirections when the HTTP
method is not HEAD
. In other words, for non-HEAD
requests the
returned HttpResponse
must be the final response, after all redirects.
For HEAD
requests redirects MUST NOT be resolved.
Lastly, the request downloader function MUST support the async/await
syntax.
Exception Handling¶
Page Object developers could use the exception classes built inside web-poet to handle various ways additional requests MAY fail. In this section, we’ll see the rationale and ways the framework MUST be able to do that.
Rationale¶
Frameworks that handle web-poet MUST be able to ensure that Page Objects
having additional requests using HttpClient
are able to work with
any type of HTTP downloader implementation.
For example, in Python, the common HTTP libraries have different types of base exceptions when something has occurred:
Imagine if Page Objects are expected to work in different backend implementations like the ones above, then it would cause the code to look like:
import urllib
import aiohttp
import attrs
import requests
import web_poet
from web_poet import validates_input
@attrs.define
class SomePage(web_poet.WebPage):
http: web_poet.HttpClient
@validates_input
async def to_item(self):
try:
response = await self.http.get("...")
except (aiohttp.ClientError, requests.RequestException, urllib.error.HTTPError):
# handle the error here
Such code could turn messy in no time especially when the number of HTTP backends that Page Objects have to support are steadily increasing. Not to mention the plethora of exception types that HTTP libraries have. This means that Page Objects aren’t truly portable in different types of frameworks or environments. Rather, they’re only limited to work in the specific framework they’re supported.
In order for Page Objects to work in different Downloader Implementations,
the framework that implements the HTTP Downloader backend MUST raise
exceptions from the web_poet.exceptions.http
module in lieu of the backend
specific ones (e.g. aiohttp, requests, urllib, etc.).
This makes the code simpler:
import attrs
import web_poet
from web_poet import validates_input
@attrs.define
class SomePage(web_poet.WebPage):
http: web_poet.HttpClient
@validates_input
async def to_item(self):
try:
response = await self.http.get("...")
except web_poet.exceptions.HttpError:
# handle the error here
Expected behavior for Exceptions¶
All exceptions that the HTTP Downloader Implementation (see Providing the Downloader
doc section) explicitly raises when implementing it for web-poet MUST be
web_poet.exceptions.http.HttpError
(or a subclass from it).
For frameworks that implement and use web-poet, exceptions that occurred when
handling the additional requests like connection errors, TLS errors, etc MUST
be replaced by web_poet.exceptions.http.HttpRequestError
by raising it
explicitly.
For responses that are not really errors like in the 100-3xx
status code range,
exception MUST NOT be raised at all. For responses with status codes in
the 400-5xx
range, web-poet raises the web_poet.exceptions.http.HttpResponseError
exception.
From this distinction, the framework MUST NOT raise web_poet.exceptions.http.HttpResponseError
on its own at all, since the HttpClient
already handles that.
Supporting Retries¶
Web-poet frameworks must catch Retry
exceptions raised from the to_item()
method of a
page object.
When Retry
is caught:
The original request whose response was fed into the page object must be retried.
A new page object must be created, of the same type as the original page object, and with the same input, except for the response data, which must be the new response.
The to_item()
method of the new page object may
raise Retry
again. Web-poet frameworks must
allow multiple retries of page objects, repeating the
Retry
-capturing logic.
However, web-poet frameworks are also encouraged to limit the amount of retries per page object. When retries are exceeded for a given page object, the page object output is ignored. At the moment, web-poet does not enforce any specific maximum number of retries on web-poet frameworks.
Supporting stats¶
To support stats, your framework must provide the
StatCollector
implementation of
Stats
.
It is up to you to decide how to store the stats, and how your users can access them at run time (outside page objects) or afterwards.
API reference¶
Page Inputs¶
- class web_poet.page_inputs.browser.BrowserHtml[source]¶
Bases:
SelectableMixin
,str
HTML returned by a web browser, i.e. snapshot of the DOM tree in HTML format.
- css(query) SelectorList ¶
A shortcut to
.selector.css()
.
- jmespath(query: str, **kwargs) SelectorList ¶
A shortcut to
.selector.jmespath()
.
- property selector: Selector¶
Cached instance of
parsel.selector.Selector
.
- xpath(query, **kwargs) SelectorList ¶
A shortcut to
.selector.xpath()
.
- class web_poet.page_inputs.browser.BrowserResponse(url: Union[str, _Url], html, *, status: Optional[int] = None)[source]¶
Bases:
SelectableMixin
,UrlShortcutsMixin
Browser response: url, HTML and status code.
url
should be browser’s window.location, not a URL of the request, if possible.html
contains the HTML returned by the browser, i.e. a snapshot of DOM tree in HTML format.The following are optional since it would depend on the source of the
BrowserResponse
if these are available or not:status
should represent the int status code of the HTTP response.- url: ResponseUrl¶
- html: BrowserHtml¶
- css(query) SelectorList ¶
A shortcut to
.selector.css()
.
- jmespath(query: str, **kwargs) SelectorList ¶
A shortcut to
.selector.jmespath()
.
- property selector: Selector¶
Cached instance of
parsel.selector.Selector
.
- urljoin(url: Union[str, RequestUrl, ResponseUrl]) RequestUrl ¶
Return url as an absolute URL.
If url is relative, it is made absolute relative to the base URL of self.
- xpath(query, **kwargs) SelectorList ¶
A shortcut to
.selector.xpath()
.
- class web_poet.page_inputs.client.HttpClient(request_downloader: Optional[Callable] = None, *, save_responses: bool = False, return_only_saved_responses: bool = False, responses: Optional[Iterable[_SavedResponseData]] = None)[source]¶
Async HTTP client to be used in Page Objects.
See Additional requests for the usage information.
HttpClient doesn’t make HTTP requests on itself. It uses either the request function assigned to the
web_poet.request_downloader_var
contextvar
, or a function passed viarequest_downloader
argument of the__init__()
method.Either way, this function should be an
async def
function which receives anHttpRequest
instance, and either returns aHttpResponse
instance, or raises a subclass ofHttpError
. You can read more in the Providing the Downloader documentation.- async request(url: Union[str, _Url], *, method: str = 'GET', headers: Optional[Union[Dict[str, str], HttpRequestHeaders]] = None, body: Optional[Union[bytes, HttpRequestBody]] = None, allow_status: Optional[Union[str, int, List[Union[str, int]]]] = None) HttpResponse [source]¶
This is a shortcut for creating an
HttpRequest
instance and executing that request.HttpRequestError
is raised for connection errors, connection and read timeouts, etc.An
HttpResponse
instance is returned for successful responses in the100-3xx
status code range.Otherwise, an exception of type
HttpResponseError
is raised.Rasing
HttpResponseError
can be suppressed for certain status codes using theallow_status
param - it is a list of status code values for whichHttpResponse
should be returned instead of raisingHttpResponseError
.There is a special “*”
allow_status
value which allows any status code.There is no need to include
100-3xx
status codes inallow_status
, becauseHttpResponseError
is not raised for them.
- async get(url: Union[str, _Url], *, headers: Optional[Union[Dict[str, str], HttpRequestHeaders]] = None, allow_status: Optional[Union[str, int, List[Union[str, int]]]] = None) HttpResponse [source]¶
Similar to
request()
but peforming aGET
request.
- async post(url: Union[str, _Url], *, headers: Optional[Union[Dict[str, str], HttpRequestHeaders]] = None, body: Optional[Union[bytes, HttpRequestBody]] = None, allow_status: Optional[Union[str, int, List[Union[str, int]]]] = None) HttpResponse [source]¶
Similar to
request()
but performing aPOST
request.
- async execute(request: HttpRequest, *, allow_status: Optional[Union[str, int, List[Union[str, int]]]] = None) HttpResponse [source]¶
Execute the specified
HttpRequest
instance using the request implementation configured in theHttpClient
instance.HttpRequestError
is raised for connection errors, connection and read timeouts, etc.HttpResponse
instance is returned for successful responses in the100-3xx
status code range.Otherwise, an exception of type
HttpResponseError
is raised.Rasing
HttpResponseError
can be suppressed for certain status codes using theallow_status
param - it is a list of status code values for whichHttpResponse
should be returned instead of raisingHttpResponseError
.There is a special “*”
allow_status
value which allows any status code.There is no need to include
100-3xx
status codes inallow_status
, becauseHttpResponseError
is not raised for them.
- async batch_execute(*requests: HttpRequest, return_exceptions: bool = False, allow_status: Optional[Union[str, int, List[Union[str, int]]]] = None) List[Union[HttpResponse, HttpResponseError]] [source]¶
Similar to
execute()
but accepts a collection ofHttpRequest
instances that would be batch executed.The order of the
HttpResponses
would correspond to the order ofHttpRequest
passed.If any of the
HttpRequest
raises an exception upon execution, the exception is raised.To prevent this, the actual exception can be returned alongside any successful
HttpResponse
. This enables salvaging any usable responses despite any possible failures. This can be done by settingTrue
to thereturn_exceptions
parameter.Like
execute()
,HttpResponseError
will be raised for responses with status codes in the400-5xx
range. Theallow_status
parameter could be used the same way here to prevent these exceptions from being raised.You can omit
allow_status="*"
if you’re passingreturn_exceptions=True
. However, it would be returningHttpResponseError
instead ofHttpResponse
.Lastly, a
HttpRequestError
may be raised on cases like connection errors, connection and read timeouts, etc.
- class web_poet.page_inputs.http.RequestUrl(*args, **kwargs)¶
Bases:
RequestUrl
- class web_poet.page_inputs.http.ResponseUrl(*args, **kwargs)¶
Bases:
ResponseUrl
- class web_poet.page_inputs.http.HttpRequestBody[source]¶
Bases:
bytes
A container for holding the raw HTTP request body in bytes format.
- class web_poet.page_inputs.http.HttpResponseBody[source]¶
Bases:
bytes
A container for holding the raw HTTP response body in bytes format.
- class web_poet.page_inputs.http.HttpRequestHeaders[source]¶
Bases:
_HttpHeaders
A container for holding the HTTP request headers.
It’s able to accept instantiation via an Iterable of Tuples:
>>> pairs = [("Content-Encoding", "gzip"), ("content-length", "648")] >>> HttpRequestHeaders(pairs) <HttpRequestHeaders('Content-Encoding': 'gzip', 'content-length': '648')>
It’s also accepts a mapping of key-value pairs as well:
>>> pairs = {"Content-Encoding": "gzip", "content-length": "648"} >>> headers = HttpRequestHeaders(pairs) >>> headers <HttpRequestHeaders('Content-Encoding': 'gzip', 'content-length': '648')>
Note that this also supports case insensitive header-key lookups:
>>> headers.get("content-encoding") 'gzip' >>> headers.get("Content-Length") '648'
These are just a few of the functionalities it inherits from
multidict.CIMultiDict
. For more info on its other features, read the API spec ofmultidict.CIMultiDict
.- copy()¶
Return a copy of itself.
- classmethod from_bytes_dict(arg: Dict[AnyStr, Union[AnyStr, List, Tuple[AnyStr, ...]]], encoding: str = 'utf-8') T_headers ¶
An alternative constructor for instantiation where the header-value pairs could be in raw bytes form.
This supports multiple header values in the form of
List[bytes]
andTuple[bytes]]
alongside a plainbytes
value. A value instr
also works and wouldn’t break the decoding process at all.By default, it converts the
bytes
value using “utf-8”. However, this can easily be overridden using theencoding
parameter.>>> raw_values = { ... b"Content-Encoding": [b"gzip", b"br"], ... b"Content-Type": [b"text/html"], ... b"content-length": b"648", ... } >>> headers = _HttpHeaders.from_bytes_dict(raw_values) >>> headers <_HttpHeaders('Content-Encoding': 'gzip', 'Content-Encoding': 'br', 'Content-Type': 'text/html', 'content-length': '648')>
- classmethod from_name_value_pairs(arg: List[Dict]) T_headers ¶
An alternative constructor for instantiation using a
List[Dict]
where the ‘key’ is the header name while the ‘value’ is the header value.>>> pairs = [ ... {"name": "Content-Encoding", "value": "gzip"}, ... {"name": "content-length", "value": "648"} ... ] >>> headers = _HttpHeaders.from_name_value_pairs(pairs) >>> headers <_HttpHeaders('Content-Encoding': 'gzip', 'content-length': '648')>
- class web_poet.page_inputs.http.HttpResponseHeaders[source]¶
Bases:
_HttpHeaders
A container for holding the HTTP response headers.
It’s able to accept instantiation via an Iterable of Tuples:
>>> pairs = [("Content-Encoding", "gzip"), ("content-length", "648")] >>> HttpResponseHeaders(pairs) <HttpResponseHeaders('Content-Encoding': 'gzip', 'content-length': '648')>
It’s also accepts a mapping of key-value pairs as well:
>>> pairs = {"Content-Encoding": "gzip", "content-length": "648"} >>> headers = HttpResponseHeaders(pairs) >>> headers <HttpResponseHeaders('Content-Encoding': 'gzip', 'content-length': '648')>
Note that this also supports case insensitive header-key lookups:
>>> headers.get("content-encoding") 'gzip' >>> headers.get("Content-Length") '648'
These are just a few of the functionalities it inherits from
multidict.CIMultiDict
. For more info on its other features, read the API spec ofmultidict.CIMultiDict
.- declared_encoding() Optional[str] [source]¶
Return encoding detected from the Content-Type header, or None if encoding is not found
- copy()¶
Return a copy of itself.
- classmethod from_bytes_dict(arg: Dict[AnyStr, Union[AnyStr, List, Tuple[AnyStr, ...]]], encoding: str = 'utf-8') T_headers ¶
An alternative constructor for instantiation where the header-value pairs could be in raw bytes form.
This supports multiple header values in the form of
List[bytes]
andTuple[bytes]]
alongside a plainbytes
value. A value instr
also works and wouldn’t break the decoding process at all.By default, it converts the
bytes
value using “utf-8”. However, this can easily be overridden using theencoding
parameter.>>> raw_values = { ... b"Content-Encoding": [b"gzip", b"br"], ... b"Content-Type": [b"text/html"], ... b"content-length": b"648", ... } >>> headers = _HttpHeaders.from_bytes_dict(raw_values) >>> headers <_HttpHeaders('Content-Encoding': 'gzip', 'Content-Encoding': 'br', 'Content-Type': 'text/html', 'content-length': '648')>
- classmethod from_name_value_pairs(arg: List[Dict]) T_headers ¶
An alternative constructor for instantiation using a
List[Dict]
where the ‘key’ is the header name while the ‘value’ is the header value.>>> pairs = [ ... {"name": "Content-Encoding", "value": "gzip"}, ... {"name": "content-length", "value": "648"} ... ] >>> headers = _HttpHeaders.from_name_value_pairs(pairs) >>> headers <_HttpHeaders('Content-Encoding': 'gzip', 'content-length': '648')>
- class web_poet.page_inputs.http.HttpRequest(url: Union[str, _Url], *, method: str = 'GET', headers=_Nothing.NOTHING, body=_Nothing.NOTHING)[source]¶
Bases:
object
Represents a generic HTTP request used by other functionalities in web-poet like
HttpClient
.- url: RequestUrl¶
- headers: HttpRequestHeaders¶
- body: HttpRequestBody¶
- class web_poet.page_inputs.http.HttpResponse(url: Union[str, _Url], body, *, status: Optional[int] = None, headers=_Nothing.NOTHING, encoding: Optional[str] = None)[source]¶
Bases:
SelectableMixin
,UrlShortcutsMixin
A container for the contents of a response, downloaded directly using an HTTP client.
url
should be a URL of the response (after all redirects), not a URL of the request, if possible.body
contains the raw HTTP response body.The following are optional since it would depend on the source of the
HttpResponse
if these are available or not. For example, the responses could simply come off from a local HTML file which doesn’t containheaders
andstatus
.status
should represent the int status code of the HTTP response.headers
should contain the HTTP response headers.encoding
encoding of the response. If None (default), encoding is auto-detected from headers and body content.- url: ResponseUrl¶
- body: HttpResponseBody¶
- headers: HttpResponseHeaders¶
- property text: str¶
Content of the HTTP body, converted to unicode using the detected encoding of the response, according to the web browser rules (respecting Content-Type header, etc.)
- css(query) SelectorList ¶
A shortcut to
.selector.css()
.
- jmespath(query: str, **kwargs) SelectorList ¶
A shortcut to
.selector.jmespath()
.
- property selector: Selector¶
Cached instance of
parsel.selector.Selector
.
- urljoin(url: Union[str, RequestUrl, ResponseUrl]) RequestUrl ¶
Return url as an absolute URL.
If url is relative, it is made absolute relative to the base URL of self.
- xpath(query, **kwargs) SelectorList ¶
A shortcut to
.selector.xpath()
.
- web_poet.page_inputs.http.request_fingerprint(req: HttpRequest) str [source]¶
Return the fingerprint of the request.
- class web_poet.page_inputs.response.AnyResponse(response: Union[BrowserResponse, HttpResponse])[source]¶
Bases:
SelectableMixin
,UrlShortcutsMixin
A container that holds either
BrowserResponse
orHttpResponse
.- response: Union[BrowserResponse, HttpResponse]¶
- property url: ResponseUrl¶
URL of the response.
- css(query) SelectorList ¶
A shortcut to
.selector.css()
.
- jmespath(query: str, **kwargs) SelectorList ¶
A shortcut to
.selector.jmespath()
.
- property selector: Selector¶
Cached instance of
parsel.selector.Selector
.
- urljoin(url: Union[str, RequestUrl, ResponseUrl]) RequestUrl ¶
Return url as an absolute URL.
If url is relative, it is made absolute relative to the base URL of self.
- xpath(query, **kwargs) SelectorList ¶
A shortcut to
.selector.xpath()
.
- class web_poet.page_inputs.page_params.PageParams[source]¶
Bases:
dict
Container class that could contain any arbitrary data to be passed into a Page Object.
Note that this is simply a subclass of Python’s
dict
.
- class web_poet.page_inputs.stats.StatCollector[source]¶
Bases:
ABC
Base class for web-poet to implement the storing of data written through
Stats
.
- class web_poet.page_inputs.stats.DummyStatCollector[source]¶
Bases:
StatCollector
StatCollector
implementation that does not persist stats. It is used when running automatic tests, where stat storage is not necessary.
- class web_poet.page_inputs.stats.Stats(stat_collector=None)[source]¶
Bases:
object
Page input class to write key-value data pairs during parsing that you can inspect later. See Stats.
Stats can be set to a fixed value or, if numeric, incremented.
Stats are write-only.
Storage and read access of stats depends on the web-poet framework that you are using. Check the documentation of your web-poet framework to find out if it supports stats, and if so, how to read stored stats.
Pages¶
- class web_poet.pages.Injectable[source]¶
Bases:
ABC
,FieldsMixin
Base Page Object class, which all Page Objects should inherit from (probably through Injectable subclasses).
Frameworks which are using
web-poet
Page Objects should useis_injectable()
function to detect if an object is an Injectable, and if an object is injectable, allow building it automatically through dependency injection, using https://github.com/scrapinghub/andi library.Instead of inheriting you can also use
Injectable.register(MyWebPage)
.Injectable.register
can also be used as a decorator.
- web_poet.pages.is_injectable(cls: Any) bool [source]¶
Return True if
cls
is a class which inherits fromInjectable
.
- class web_poet.pages.ItemPage[source]¶
Bases:
Extractor
[ItemT
],Injectable
Base class for page objects.
- class web_poet.pages.WebPage(response: HttpResponse)[source]¶
Bases:
ItemPage
[ItemT
],ResponseShortcutsMixin
Base Page Object which requires
HttpResponse
and provides XPath / CSS shortcuts.- response: HttpResponse¶
- css(query) SelectorList ¶
A shortcut to
.selector.css()
.
- jmespath(query: str, **kwargs) SelectorList ¶
A shortcut to
.selector.jmespath()
.
- property selector: Selector¶
Cached instance of
parsel.selector.Selector
.
- async to_item() ItemT ¶
Extract an item from a web page
- xpath(query, **kwargs) SelectorList ¶
A shortcut to
.selector.xpath()
.
- class web_poet.pages.Returns[source]¶
Bases:
Generic
[ItemT
]Inherit from this generic mixin to change the item class used by
ItemPage
- class web_poet.pages.Extractor[source]¶
Bases:
Returns
[ItemT
],FieldsMixin
Base class for field support.
Mixins¶
- class web_poet.mixins.ResponseShortcutsMixin(*args, **kwargs)[source]¶
Common shortcut methods for working with HTML responses. This mixin could be used with Page Object base classes.
It requires “response” attribute to be present.
- urljoin(url: str) str [source]¶
Convert url to absolute, taking in account url and baseurl of the response
- css(query) SelectorList ¶
A shortcut to
.selector.css()
.
- jmespath(query: str, **kwargs) SelectorList ¶
A shortcut to
.selector.jmespath()
.
- property selector: Selector¶
Cached instance of
parsel.selector.Selector
.
- xpath(query, **kwargs) SelectorList ¶
A shortcut to
.selector.xpath()
.
Requests¶
- web_poet.requests.request_downloader_var: ContextVar = <ContextVar name='request_downloader'>¶
Frameworks that wants to support additional requests in
web-poet
should set the appropriate implementation ofrequest_downloader_var
for requesting data.
Exceptions¶
Core Exceptions¶
These exceptions are tied to how web-poet operates.
- exception web_poet.exceptions.core.RequestDownloaderVarError[source]¶
The
web_poet.request_downloader_var
had its contents accessed but there wasn’t any value set during the time requests are executed.See the documentation section about setting up the contextvars to learn more about this.
- exception web_poet.exceptions.core.PageObjectAction[source]¶
Base class for exceptions that can be raised from a page object to indicate something to be done about that page object.
- exception web_poet.exceptions.core.Retry[source]¶
The page object found that the input data is partial or empty, and a request retry may provide better input.
- exception web_poet.exceptions.core.UseFallback[source]¶
The page object cannot extract data from the input, but the input seems valid, so an alternative data extraction implementation for the same item type may succeed.
- exception web_poet.exceptions.core.NoSavedHttpResponse(msg: Optional[str] = None, request: Optional[HttpRequest] = None)[source]¶
Indicates that there is no saved response for this request.
Can only be raised when a
HttpClient
instance is used to get saved responses.- Parameters
request (HttpRequest) – The
HttpRequest
instance that was used.
HTTP Exceptions¶
These are exceptions pertaining to common issues faced when executing HTTP operations.
- exception web_poet.exceptions.http.HttpError(msg: Optional[str] = None, request: Optional[HttpRequest] = None)[source]¶
Bases:
OSError
Indicates that an exception has occurred when handling an HTTP operation.
This is used as a base class for more specific errors and could be vague since it could denote problems either in the HTTP Request or Response.
For more specific errors, it would be better to use
HttpRequestError
andHttpResponseError
.- Parameters
request (HttpRequest) – Request that triggered the exception.
- request: Optional[HttpRequest]¶
Request that triggered the exception.
- exception web_poet.exceptions.http.HttpRequestError(msg: Optional[str] = None, request: Optional[HttpRequest] = None)[source]¶
Bases:
HttpError
Indicates that an exception has occurred when the HTTP Request was being handled.
- Parameters
request (HttpRequest) – The
HttpRequest
instance that was used.
- exception web_poet.exceptions.http.HttpResponseError(msg: Optional[str] = None, response: Optional[HttpResponse] = None, request: Optional[HttpRequest] = None)[source]¶
Bases:
HttpError
Indicates that an exception has occurred when the HTTP Response was received.
For responses that are in the status code
100-3xx range
, this exception shouldn’t be raised at all. However, for responses in the400-5xx
, this will be raised by web-poet.Note
Frameworks implementing web-poet should NOT raise this exception.
This exception is raised by web-poet itself, based on
allow_status
parameter found in the methods ofHttpClient
.- Parameters
request (HttpRequest) – Request that got the response that triggered the exception.
response (HttpResponse) – Response that triggered the exception.
- response: Optional[HttpResponse]¶
Response that triggered the exception.
Apply Rules¶
See Rules for more context about its use cases and some examples.
- web_poet.handle_urls(include: Union[str, Iterable[str]], *, overrides: Optional[Type[ItemPage]] = None, instead_of: Optional[Type[ItemPage]] = None, to_return: Optional[Type] = None, exclude: Optional[Union[str, Iterable[str]]] = None, priority: int = 500, **kwargs)¶
Class decorator that indicates that the decorated Page Object should work for the given URL patterns.
The URL patterns are matched using the
include
andexclude
parameters whilepriority
breaks any ties. See the documentation of the url-matcher package for more information about them.This decorator is able to derive the item class returned by the Page Object. This is important since it marks what type of item the Page Object is capable of returning for the given URL patterns. For certain advanced cases, you can pass a
to_return
parameter which replaces any derived values (though this isn’t generally recommended).Passing another Page Object into the
instead_of
parameter indicates that the decorated Page Object will be used instead of that for the given set of URL patterns. See Rule precedence.Any extra parameters are stored as meta information that can be later used.
- Parameters
include – The URLs that should be handled by the decorated Page Object.
instead_of – The Page Object that should be replaced.
to_return – The item class holding the data returned by the Page Object. This could be omitted as it could be derived from the
Returns[ItemClass]
orItemPage[ItemClass]
declaration of the Page Object. See Items section.exclude – The URLs for which the Page Object should not be applied.
priority – The resolution priority in case of conflicting rules. A conflict happens when the
include
,override
, andexclude
parameters are the same. If so, the highest priority will be chosen.
- class web_poet.rules.ApplyRule(for_patterns: Union[str, Patterns], *, use: Type[ItemPage], instead_of: Optional[Type[ItemPage]] = None, to_return: Optional[Type[Any]] = None, meta: Dict[str, Any] = _Nothing.NOTHING)[source]¶
A rule that primarily applies Page Object and Item overrides for a given URL pattern.
This is instantiated when using the
web_poet.handle_urls()
decorator. It’s also being returned as aList[ApplyRule]
when calling theweb_poet.default_registry
’sget_rules()
method.You can access any of its attributes:
for_patterns
- contains the list of URL patterns associated with this rule. You can read the API documentation of the url-matcher package for more information about the patterns.use
- The Page Object that will be used in cases where the URL pattern represented by thefor_patterns
attribute is matched.instead_of
- (optional) The Page Object that will be replaced with the Page Object specified via theuse
parameter.to_return
- (optional) The item class that the Page Object specified inuse
is capable of returning.meta
- (optional) Any other information you may want to store. This doesn’t do anything for now but may be useful for future API updates.
The main functionality of this class lies in the
instead_of
andto_return
parameters. Should both of these be omitted, thenApplyRule
simply tags which URL patterns the given Page Object defined inuse
is expected to be used on.When
to_return
is not None (e.g.to_return=MyItem
), the Page Object inuse
is declared as capable of returning a certain item class (i.e.MyItem
).When
instead_of
is not None (e.g.instead_of=ReplacedPageObject
), the rule adds an expectation that theReplacedPageObject
wouldn’t be used for the URLs matchingfor_patterns
, since the Page Object inuse
will replace it.If there are multiple rules which match a certain URL, the rule to apply is picked based on the priorities set in
for_patterns
.More information regarding its usage in Rules.
Tip
The
ApplyRule
is also hashable. This makes it easy to store unique rules and identify any duplicates.
- class web_poet.rules.RulesRegistry(*, rules: Optional[Iterable[ApplyRule]] = None)[source]¶
RulesRegistry provides features for storing, retrieving, and searching for the
ApplyRule
instances.web-poet
provides a default Registry nameddefault_registry
for convenience. It can be accessed this way:from web_poet import handle_urls, default_registry, WebPage from my_items import Product @handle_urls("example.com") class ExampleComProductPage(WebPage[Product]): ... rules = default_registry.get_rules()
The
@handle_urls
decorator exposed asweb_poet.handle_urls
is a shortcut fordefault_registry.handle_urls
.Note
It is encouraged to use the
web_poet.default_registry
instead of creating your ownRulesRegistry
instance. Using multiple registries would be unwieldy in most cases.However, it might be applicable in certain scenarios like storing custom rules to separate it from the
default_registry
.- add_rule(rule: ApplyRule) None [source]¶
Registers an
web_poet.rules.ApplyRule
instance.
- classmethod from_override_rules(rules: List[ApplyRule]) RulesRegistryTV [source]¶
Deprecated. Use
RulesRegistry(rules=...)
instead.
- get_rules() List[ApplyRule] [source]¶
Return all the
ApplyRule
that were declared using the@handle_urls
decorator.Note
Remember to consider calling
consume_modules()
beforehand to recursively import all submodules which contains the@handle_urls
decorators from external Page Objects.
- get_overrides() List[ApplyRule] [source]¶
Deprecated, use
get_rules()
instead.
- search(**kwargs) List[ApplyRule] [source]¶
Return any
ApplyRule
from the registry that matches with all the provided attributes.Sample usage:
rules = registry.search(use=ProductPO, instead_of=GenericPO) print(len(rules)) # 1 print(rules[0].use) # ProductPO print(rules[0].instead_of) # GenericPO
- web_poet.rules.consume_modules(*modules: str) None [source]¶
This recursively imports all packages/modules so that the
@handle_urls
decorators are properly discovered and imported.Let’s take a look at an example:
# FILE: my_page_obj_project/load_rules.py from web_poet import default_registry, consume_modules consume_modules("other_external_pkg.po", "another_pkg.lib") rules = default_registry.get_rules()
For this case, the
ApplyRule
are coming from:my_page_obj_project
(since it’s the same module as the file above)other_external_pkg.po
another_pkg.lib
any other modules that was imported in the same process inside the packages/modules above.
If the
default_registry
had other@handle_urls
decorators outside of the packages/modules listed above, then the correspondingApplyRule
won’t be returned. Unless, they were recursively imported in some way similar toconsume_modules()
.
- class web_poet.rules.OverrideRule(*args, **kwargs)¶
- class web_poet.rules.PageObjectRegistry(*args, **kwargs)¶
Fields¶
web_poet.fields
is a module with helpers for putting extraction logic
into separate Page Object methods / properties.
- class web_poet.fields.FieldInfo(name: str, meta: Optional[dict] = None, out: Optional[List[Callable]] = None)[source]¶
Information about a field
- web_poet.fields.field(method=None, *, cached: bool = False, meta: Optional[dict] = None, out: Optional[List[Callable]] = None)[source]¶
Page Object method decorated with
@field
decorator becomes a property, which is then used byItemPage
’s to_item() method to populate a corresponding item attribute.By default, the value is computed on each property access. Use
@field(cached=True)
to cache the property value.The
meta
parameter allows to store arbitrary information for the field, e.g.@field(meta={"expensive": True})
. This information can be later retrieved for all fields using theget_fields_dict()
function.The
out
parameter is an optional list of field processors, which are functions applied to the value of the field before returning it.
- web_poet.fields.get_fields_dict(cls_or_instance) Dict[str, FieldInfo] [source]¶
Return a dictionary with information about the fields defined for the class: keys are field names, and values are
web_poet.fields.FieldInfo
instances.
- async web_poet.fields.item_from_fields(obj, item_cls: ~typing.Type[~web_poet.fields.T] = <class 'dict'>, *, skip_nonitem_fields: bool = False) T [source]¶
Return an item of
item_cls
type, with its attributes populated from theobj
methods decorated withfield
decorator.If
skip_nonitem_fields
is True,@fields
whose names are not amongitem_cls
field names are not passed toitem_cls.__init__
.When
skip_nonitem_fields
is False (default), all@fields
are passed toitem_cls.__init__
, possibly causing exceptions ifitem_cls.__init__
doesn’t support them.
- web_poet.fields.item_from_fields_sync(obj, item_cls: ~typing.Type[~web_poet.fields.T] = <class 'dict'>, *, skip_nonitem_fields: bool = False) T [source]¶
Synchronous version of
item_from_fields()
.
typing.Annotated support¶
- class web_poet.annotated.AnnotatedInstance(result: Any, metadata: Tuple[Any, ...])[source]¶
Wrapper for instances of annotated dependencies.
It is used when both the dependency value and the dependency annotation are needed.
- Parameters
result (Any) – The wrapped dependency instance.
metadata (Tuple[Any, ...]) – The copy of the annotation.
Utils¶
- web_poet.utils.get_fq_class_name(cls: type) str [source]¶
Return the fully qualified name for a type.
>>> from web_poet import Injectable >>> get_fq_class_name(Injectable) 'web_poet.pages.Injectable' >>> from decimal import Decimal >>> get_fq_class_name(Decimal) 'decimal.Decimal'
- web_poet.utils.memoizemethod_noargs(method: CallableT) CallableT [source]¶
Decorator to cache the result of a method (without arguments) using a weak reference to its object.
It is faster than
cached_method()
, and doesn’t add new attributes to the instance, but it doesn’t work if objects are unhashable.
- web_poet.utils.cached_method(method: CallableT) CallableT [source]¶
A decorator to cache method or coroutine method results, so that if it’s called multiple times for the same instance, computation is only done once.
The cache is unbound, but it’s tied to the instance lifetime.
Note
cached_method()
is needed becausefunctools.lru_cache()
doesn’t work well on methods: self is used as a cache key, so a reference to an instance is kept in the cache, and this prevents deallocation of instances.This decorator adds a new private attribute to the instance named
_cached_method_{decorated_method_name}
; make sure the class doesn’t define an attribute of the same name.
- web_poet.utils.as_list(value: Optional[Any]) List[Any] [source]¶
Normalizes the value input as a list.
>>> as_list(None) [] >>> as_list("foo") ['foo'] >>> as_list(123) [123] >>> as_list(["foo", "bar", 123]) ['foo', 'bar', 123] >>> as_list(("foo", "bar", 123)) ['foo', 'bar', 123] >>> as_list(range(5)) [0, 1, 2, 3, 4] >>> def gen(): ... yield 1 ... yield 2 >>> as_list(gen()) [1, 2]
Example framework¶
The web_poet.example
module is a simplified, incomplete example of a
web-poet framework, written as support material for the tutorial.
No part of the web_poet.example
module is intended for production use,
and it may change in a backward-incompatible way at any point in the future.
- web_poet.example.get_item(url: str, item_cls: Type, *, page_params: Optional[Dict[Any, Any]] = None) Any [source]¶
Returns an item built from the specified URL using a page object class from the default registry.
This function is an example of a minimal, incomplete web-poet framework implementation, intended for use in the web-poet tutorial.
Contributing¶
web-poet is an open-source project. Your contribution is very welcome!
Issue Tracker¶
If you have a bug report, a new feature proposal or simply would like to make a question, please check our issue tracker on Github: https://github.com/scrapinghub/web-poet/issues
Source code¶
Our source code is hosted on Github: https://github.com/scrapinghub/web-poet
Before opening a pull request, it might be worth checking current and previous issues. Some code changes might also require some discussion before being accepted so it might be worth opening a new issue before implementing huge or breaking changes.
Testing¶
We use tox to run tests with different Python versions:
tox
The command above also runs type checks; we use mypy.
Changelog¶
0.17.0 (2024-03-04)¶
Now requires
andi >= 0.5.0
.Package requirements that were unversioned now have minimum versions specified.
Added support for Python 3.12.
Added support for
typing.Annotated
dependencies to the serialization and testing code.Documentation improvements.
CI improvements.
0.16.0 (2024-01-23)¶
Added new
AnyResponse
which holds eitherBrowserResponse
, orHttpResponse
.Documentation improvements.
0.15.1 (2023-11-21)¶
HttpRequestHeaders
now has afrom_bytes_dict
class method, likeHttpResponseHeaders
.
0.15.0 (2023-09-11)¶
0.14.0 (2023-08-03)¶
Dropped Python 3.7 support.
Now requires
packaging >= 20.0
.Fixed detection of the
Returns
base class.Improved docs.
Updated type hints.
Updated CI tools.
0.13.1 (2023-05-30)¶
Fixed an issue with
HttpClient
which happens when a response with a non-standard status code is received.
0.13.0 (2023-05-30)¶
A new dependency
BrowserResponse
has been added. It contains a browser-rendered page URL, status code and HTML.The Rules documentation section has been rewritten.
0.12.0 (2023-05-05)¶
The testing framework now allows defining a custom item adapter.
We have made a backward-incompatible change on test fixture serialization: the
type_name
field of exceptions has been renamed toimport_path
.Fixed built-in Python types, e.g.
int
, not working as field processors.
0.11.0 (2023-04-24)¶
JMESPath support is now available: you can use
WebPage.jmespath()
andHttpResponse.jmespath()
to run queries on JSON responses.The testing framework now supports page objects that raise exceptions from the
to_item
method.
0.10.0 (2023-04-19)¶
New class
Extractor
can be used for easier extraction of nested fields (see Processors for nested fields).Exceptions raised while getting a response for an additional request are now saved in test fixtures.
Multiple documentation improvements and fixes.
Add a
twine check
CI check.
0.9.0 (2023-03-30)¶
Standardized input validation.
Field processors can now also be defined through a nested
Processors
class, so that field redefinitions in subclasses can inherit them. See Default processors.Field processors can now opt in to receive the page object whose field is being read.
web_poet.fields.FieldsMixin
now keeps fields from all base classes when using multiple inheritance.Fixed the documentation build.
0.8.1 (2023-03-03)¶
Fix the error when calling
.to_item()
,item_from_fields_sync()
, oritem_from_fields()
on page objects defined as slotted attrs classes, while settingskip_nonitem_fields=True
.
0.8.0 (2023-02-23)¶
This release contains many improvements to the web-poet testing framework, as well as some other improvements and bug fixes.
Backward-incompatible changes:
cached_method()
no longer caches exceptions forasync def
methods. This makes the behavior the same for sync and async methods, and also makes it consistent with Python’s stdlib caching (i.e.functools.lru_cache()
,functools.cached_property()
).The testing framework now uses the
HttpResponse-info.json
file name instead ofHttpResponse-other.json
to store information about HttpResponse instances. To make tests generated with older web-poet work, rename these files on disk.
Testing framework improvements:
Improved test reporting: better diffs and error messages.
By default, the pytest plugin now generates a test per item attribute (see Running tests). There is also an option (
--web-poet-test-per-item
) to run a test per item instead.Page objects with the
HttpClient
dependency are now supported (see Additional requests support).Page objects with the
PageParams
dependency are now supported.Added a new
python -m web_poet.testing rerun
command (see Test-Driven Development).Fixed support for nested (indirect) dependencies in page objects. Previously they were not handled properly by the testing framework.
Non-ASCII output is now stored without escaping in the test fixtures, for better readability.
Other changes:
Testing and CI fixes.
Fixed a packaging issue:
tests
andtests_extra
packages were installed, not justweb_poet
.
0.7.2 (2023-02-01)¶
Restore the minimum version of
itemadapter
from 0.7.1 to 0.7.0, and prevent a similar issue from happening again in the future.
0.7.1 (2023-02-01)¶
Updated the tutorial to cover recent features and focus on best practices. Also, a new module was added,
web_poet.example
, that allows using page objects while following the tutorial.Tests for page objects now covers Git LFS and scrapy-poet, and recommends
python -m pytest
instead ofpytest
.Improved the warning message when duplicate
ApplyRule
objects are found.HttpResponse-other.json
content is now indented for better readability.Improved test coverage for fields.
0.7.0 (2023-01-18)¶
Add a framework for creating tests and running them with pytest.
Support implementing fields in mixin classes.
Introduce new methods for
web_poet.rules.RulesRegistry
:Improved the performance of
web_poet.rules.RulesRegistry.search()
where passing a single parameter of eitherinstead_of
orto_return
results in O(1) look-up time instead of O(N). Additionally, having eitherinstead_of
orto_return
present in multi-parameter search calls would filter the initial candidate results resulting in a faster search.Support page object dependency serialization.
Add new dependencies used in testing and serialization code:
andi
,python-dateutil
, andtime-machine
. Alsobackports.zoneinfo
on non-Windows platforms when the Python version is older than 3.9.
0.6.0 (2022-11-08)¶
In this release, the @handle_urls
decorator gets an overhaul; it’s not
required anymore to pass another Page Object class to
@handle_urls("...", overrides=...)
.
Also, the @web_poet.field
decorator gets support for output processing
functions, via the out
argument.
Full list of changes:
Backwards incompatible
PageObjectRegistry
is no longer supporting dict-like access.Official support for Python 3.11.
New
@web_poet.field(out=[...])
argument which allows to set output processing functions for web-poet fields.The
web_poet.overrides
module is deprecated and replaced withweb_poet.rules
.The
@handle_urls
decorator is now creatingApplyRule
instances instead ofOverrideRule
instances;OverrideRule
is deprecated.ApplyRule
is similar toOverrideRule
, but has the following differences:ApplyRule
accepts ato_return
parameter, which should be the data container (item) class that the Page Object returns.Passing a string to
for_patterns
would auto-convert it intourl_matcher.Patterns
.All arguments are now keyword-only except for
for_patterns
.
New signature and behavior of
handle_urls
:The
overrides
parameter is made optional and renamed toinstead_of
.If defined, the item class declared in a subclass of
web_poet.ItemPage
is used as theto_return
parameter ofApplyRule
.Multiple
handle_urls
annotations are allowed.
PageObjectRegistry
is replaced withRulesRegistry
; its API is changed:backwards incompatible dict-like API is removed;
backwards incompatible O(1) lookups using
.search(use=PagObject)
has become O(N);search_overrides
method is renamed tosearch
;get_overrides
method is renamed toget_rules
;from_override_rules
method is deprecated; useRulesRegistry(rules=...)
instead.
Typing improvements.
Documentation, test, and warning message improvements.
Deprecations:
The
web_poet.overrides
module is deprecated. Useweb_poet.rules
instead.The
overrides
parameter from@handle_urls
is now deprecated. Use theinstead_of
parameter instead.The
OverrideRule
class is now deprecated. UseApplyRule
instead.PageObjectRegistry
is now deprecated. UseRulesRegistry
instead.The
from_override_rules
method ofPageObjectRegistry
is now deprecated. UseRulesRegistry(rules=...)
instead.The
PageObjectRegistry.get_overrides
method is deprecated. UsePageObjectRegistry.get_rules
instead.The
PageObjectRegistry.search_overrides
method is deprecated. UsePageObjectRegistry.search
instead.
0.5.1 (2022-09-23)¶
The BOM encoding from the response body is now read before the response headers when deriving the response encoding.
Minor typing improvements.
0.5.0 (2022-09-21)¶
Web-poet now includes a mini-framework for organizing extraction code as Page Object properties:
import attrs
from web_poet import field, ItemPage
@attrs.define
class MyItem:
foo: str
bar: list[str]
class MyPage(ItemPage[MyItem]):
@field
def foo(self):
return "..."
@field
def bar(self):
return ["...", "..."]
Backwards incompatible changes:
web_poet.ItemPage
is no longer an abstract base class which requiresto_item
method to be implemented. Instead, it provides a defaultasync def to_item
method implementation which uses fields marked asweb_poet.field
to create an item. This change shouldn’t affect the user code in a backwards incompatible way, but it might affect typing.
Deprecations:
web_poet.ItemWebPage
is deprecated. Useweb_poet.WebPage
instead.
Other changes:
web-poet is declared as PEP 561 package which provides typing information; mypy is going to use it by default.
Documentation, test, typing and CI improvements.
0.4.0 (2022-07-26)¶
New
HttpResponse.urljoin
method, which take page’s base url in account.New
HttpRequest.urljoin
method.standardized
web_poet.exceptions.Retry
exception, which allows to initiate a retry from the Page Object, e.g. based on page content.Documentation improvements.
0.3.0 (2022-06-14)¶
Backwards Incompatible Change:
web_poet.requests.request_backend_var
is renamed toweb_poet.requests.request_downloader_var
.
Documentation and CI improvements.
0.2.0 (2022-06-10)¶
Backward Incompatible Change:
ResponseData
is replaced withHttpResponse
.HttpResponse
exposes methods useful for web scraping (such as xpath and css selectors, json loading), and handles web page encoding detection. There are also new types likeHttpResponseBody
andHttpResponseHeaders
.
Added support for performing additional requests using
web_poet.HttpClient
.Introduced
web_poet.BrowserHtml
dependencyIntroduced
web_poet.PageParams
to pass arbitrary information inside a Page Object.Added
web_poet.handle_urls
decorator, which allows to declare which websites should be handled by the page objects. Lower-levelPageObjectRegistry
class is also available.removed support for Python 3.6
added support for Python 3.10
0.1.1 (2021-06-02)¶
base_url
andurljoin
shortcuts
0.1.0 (2020-07-18)¶
Documentation
WebPage, ItemPage, ItemWebPage, Injectable and ResponseData are available as top-level imports (e.g.
web_poet.ItemPage
)
0.0.1 (2020-04-27)¶
Initial release.
License¶
Copyright (c) Zyte Group Ltd All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of Zyte nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.