Tutorial
In this tutorial you will learn to use web-poet as you write web scraping code for book detail pages from books.toscrape.com.
To follow this tutorial you must first be familiar with Python and have installed web-poet.
Create a project directory
web-poet does not limit how you structure your web-poet web scraping code, beyond the limitations of Python itself.
However, in this tutorial you will use a specific project directory structure designed with web-poet best practices in mind. Consider using a similar project directory structure in all your web-poet projects.
First create your project directory: tutorial-project/
.
Within the tutorial-project
directory, create:
A
run.py
file, a file specific to this tutorial where you will put code to test the execution of your web scraping code.A
tutorial
directory, where you will place your web scraping code.
Within the tutorial-project/tutorial
directory, create:
An
__init__.py
file, so that thetutorial
directory becomes an importable Python module.An
items.py
file, where you will define item classes to store extracted data.A
pages
directory, where you will define your page object classes.
Within the tutorial-project/tutorial/pages
directory, create:
An
__init__.py
file.A
books_toscrape_com.py
file, for page object class code targeting books.toscrape.com.
Your project directory should look as follows:
tutorial-project
├── run.py
└── tutorial
├── __init__.py
├── items.py
└── pages
├── __init__.py
└── books_toscrape_com.py
Create an item class
While it is possible to store the extracted data in a Python dictionary, it is a good practice to create an item class that:
Defines the specific attributes that you aim to extract, triggering an exception if you extract unintended attributes or fail to extract expected attributes.
Allows defining default values for some attributes.
web-poet uses itemadapter for item class support, which means that any kind of item class can be used. In this tutorial, you will use attrs to define your item class.
Copy the following code into tutorial-project/tutorial/items.py
:
from attrs import define
@define
class Book:
title: str
This code defines a Book
item class, with a single required title
string attribute to store the book title.
Book
is a minimal class designed specifically for this tutorial. In real
web-poet projects, you will usually define item classes with many more
attributes.
Tip
For an example of real item classes, see the zyte-common-items library.
Also mind that, while in this tutorial you use Book
only for data from 1
website, books.toscrape.com, item classes are usually meant to be usable for
many different websites that provide data with a similar data schema.
Create a page object class
To write web parsing code with web-poet, you write page object classes, Python classes that define how to extract data from a given type of input, usually some type of webpage from a specific website.
In this tutorial you will write a page object class for webpages of books.toscrape.com that show details about a book, such as these:
http://books.toscrape.com/catalogue/the-exiled_247/index.html
http://books.toscrape.com/catalogue/when-we-collided_955/index.html
http://books.toscrape.com/catalogue/set-me-free_988/index.html
Copy the following code into
tutorial-project/tutorial/pages/books_toscrape_com.py
:
from web_poet import field, handle_urls, WebPage
from ..items import Book
@handle_urls("books.toscrape.com")
class BookPage(WebPage[Book]):
@field
async def title(self):
return self.css("h1::text").get()
In the code above:
You define a page object class named
BookPage
by subclassingWebPage
.It is possible to create a page object class subclassing instead the simpler
ItemPage
class. However,WebPage
:Indicates that your page object class requires an HTTP response as input, which gets stored in the
response
attribute of your page object class as anHttpResponse
object.Provides attributes like
html
andurl
, and methods likecss()
,urljoin()
, andxpath()
, that make it easier to write parsing code.
BookPage
declaresBook
as its return type.WebPage
, like its parent classItemPage
, is a generic class that accepts a type parameter. Unlike most generic classes, however, the specified type parameter is used for more than type hinting: it determines the item class that is used to store the data that fields return.BookPage
is decorated withhandle_urls()
, which indicates for which domainBookPage
is intended to work.It is possible to specify more specific URL patterns, instead of only the target URL domain. However, the URL domain and the output type (
Book
) are usually all the data needed to determine which page object class to use, which is the goal of thehandle_urls()
decorator.BookPage
defines a field namedtitle
.Fields are methods of page object classes, preferably async methods, decorated with
field()
. Fields define the logic to extract a specific piece of information from the input of your page object class.BookPage.title
extracts the title of a book from a book details webpage. Specifically, it extracts the text from the firsth1
element on the input HTTP response.Here,
title
is not an arbitrary name. It was chosen specifically to matchBook.title
, so that during parsing the value thatBookPage.title
returns gets mapped toBook.title
.
Use your page object class
Now that you have a page object class defined, it is time to use it.
First, install requests, which is required by web_poet.example
.
Then copy the following code into tutorial-project/run.py
:
from web_poet import consume_modules
from web_poet.example import get_item
from tutorial.items import Book
consume_modules("tutorial.pages")
item = get_item(
"http://books.toscrape.com/catalogue/the-exiled_247/index.html",
Book,
)
print(item)
Execute that code:
python tutorial-project/run.py
And the print(item)
statement should output the following:
Book(title='The Exiled')
In this tutorial you use web_poet.example.get_item
, which is a simple,
incomplete implementation of the web-poet specification, built specifically for
this tutorial, for demonstration purposes. In real projects, use instead an
actual web-poet framework.
web_poet.example.get_item
serves to illustrate the power of web-poet: once
you have defined your page object class, a web-poet framework only needs 2
inputs from you:
the URL from which you want to extract data, and
the desired output, either a page object class or, in this case, an item class.
Notice that you must also call consume_modules()
once
before your first call to get_item
. consume_modules
ensures that the
specified Python modules are loaded. You pass consume_modules
the import
paths of the modules where your page object classes are defined. After loading
those modules, handle_urls()
decorators register the page
object classes that they decorate into web_poet.default_registry
, which
get_item
uses to determine which page object class to use based on its
input parameters (URL and item class).
Your web-poet framework can take care of everything else:
It matches the input URL and item class to
BookPage
, based on the URL pattern that you defined with thehandle_urls()
decorator, and the return type that you declared in the page object class (Book
).It inspects the inputs declared by
BookPage
, and builds an instance ofBookPage
with the required inputs.BookPage
is aWebPage
subclass, andWebPage
declares an attribute namedresponse
of typeHttpResponse
. Your web-poet framework sees this, and creates anHttpResponse
object from the input URL as a result, by downloading the URL response, and assigns that object to theresponse
attribute of a newBookPage
object.It builds the output item,
Book(title='The Exiled')
, using theto_item()
method ofBookPage
, inherited fromItemPage
, which in turn uses all fields ofBookPage
to create an instance ofBook
, which you declared as the return type ofBookPage
.
Extend and override your code
To continue this tutorial, you will need extended versions of Book
and
BookPage
, with additional fields. However, rather than editing the existing
Book
and BookPage
classes, you will see how you can instead create new
classes that inherit them.
Append the following code to tutorial-project/tutorial/items.py
:
from typing import Optional
@define
class CategorizedBook(Book):
category: str
category_rank: Optional[int] = None
The code above defines a new item class, CategorizedBook
, that inherits the
title
attribute from Book
and defines 2 more attributes: category
and category_rank
.
Append the following code to
tutorial-project/tutorial/pages/books_toscrape_com.py
:
from web_poet import Returns
from ..items import CategorizedBook
@handle_urls("books.toscrape.com")
class CategorizedBookPage(BookPage, Returns[CategorizedBook]):
@field
async def category(self):
return self.css(".breadcrumb a::text").getall()[-1]
In the code above:
You define a new page object class:
CategorizedBookPage
.CategorizedBookPage
subclassesBookPage
, inheriting itstitle
field, and defining a new one:category
.CategorizedBookPage
does not define acategory_rank
field yet, you will add it later on. For now, the default value defined inCategorizedBook
forcategory_rank
will beNone
.CategorizedBookPage
indicates that it returns aCategorizedBook
object.WebPage
is a generic class, which is why we could useWebPage[Book]
in the definition ofBookPage
to indicateBook
as the output type ofBookPage
. However,BookPage
is not a generic class, so something likeBookPage[CategorizedBook]
would not work.So instead you use
Returns
, a special, generic class that you can inherit to re-define the output type of your page object subclasses.
After you update your tutorial-project/run.py
script to request a
CategorizedBook
item:
from web_poet import consume_modules
from web_poet.example import get_item
from tutorial.items import CategorizedBook
consume_modules("tutorial.pages")
item = get_item(
"http://books.toscrape.com/catalogue/the-exiled_247/index.html",
CategorizedBook,
)
print(item)
And you execute it again:
python tutorial-project/run.py
You can see in the new output that your new classes have been used:
CategorizedBook(title='The Exiled', category='Mystery', category_rank=None)
Use additional requests
To extract data about an item, sometimes the HTTP response to a single URL is
not enough. Sometimes, you need additional HTTP responses to get all the data
that you want. That is the case with the category_rank
attribute.
The category_rank
attribute indicates the position in which a book appears
in the list of books of the category of that book. For example,
The Exiled is 24th in the Mystery category, so the value of
category_rank
should be 24
for that book.
However, there is no indication of this value in the book details page. To get this value, you need to visit the URL of the category of the book whose data you are extracting, find the entry of that book within the grid of books of the category, and record in which position you found it. And categories with more than 20 books are split into multiple pages, so you may need more than 1 additional request for some books.
Extend CategorizedBookPage
in
tutorial-project/tutorial/pages/books_toscrape_com.py
as follows:
from attrs import define
from web_poet import HttpClient, Returns
from ..items import CategorizedBook
@handle_urls("books.toscrape.com")
@define
class CategorizedBookPage(BookPage, Returns[CategorizedBook]):
http: HttpClient
_books_per_page = 20
@field
async def category(self):
return self.css(".breadcrumb a::text").getall()[-1]
@field
async def category_rank(self):
response, book_url, page = self.response, self.url, 0
category_page_url = self.css(".breadcrumb a::attr(href)").getall()[-1]
while category_page_url:
category_page_url = response.urljoin(category_page_url)
response = await self.http.get(category_page_url)
urls = response.css("h3 a::attr(href)").getall()
for position, url in enumerate(urls, start=1):
url = str(response.urljoin(url))
if url == book_url:
return page * self._books_per_page + position
category_page_url = response.css(".next a::attr(href)").get()
if not category_page_url:
return None
page += 1
In the code above:
You declare a new input in
CategorizedBookPage
,http
, of typeHttpClient
.You also add the
@attrs.define
decorator toCategorizedBookPage
, as it is required when adding new required attributes to subclasses of attrs classes.You define the
category_rank
field so that it uses thehttp
input object to send additional requests to find the position of the current book within its category.Specifically:
You extract the category URL from the book details page.
You visit that category URL, and you iterate over the listed books until you find one with the same URL as the current book.
If you find a match, you return the position at which you found the book.
If there is no match, and there is a next page, you repeat the previous step with the URL of that next page as the category URL.
If at some point there are no more “next” pages and you have not yet found the book, you return
None
.
When you execute tutorial-project/run.py
now, category_rank
has
the expected value:
CategorizedBook(title='The Exiled', category='Mystery', category_rank=24)
Use parameters
You may notice that the execution takes longer now. That is because
CategorizedBookPage
now requires 2 or more requests, to find the value of
the category_rank
attribute.
If you use CategorizedBookPage
as part of a web scraping project that
targets a single book URL, it cannot be helped. If you want to extract the
category_rank
attribute, you need those additional requests. Your only
option to avoid additional requests is to stop extracting the category_rank
attribute.
However, if your web scraping project is targeting all book URLs from one or
more categories by visiting those category URLs, extracting book URLs from
them, and then using CategorizedBookPage
with those book URLs as input,
there is something you can change to save many requests: keep track of the
positions where you find books as you visit their categories, and pass that
position to CategorizedBookPage
as additional input.
Extend CategorizedBookPage
in
tutorial-project/tutorial/pages/books_toscrape_com.py
as follows:
from attrs import define
from web_poet import HttpClient, PageParams, Returns
from ..items import CategorizedBook
@handle_urls("books.toscrape.com")
@define
class CategorizedBookPage(BookPage, Returns[CategorizedBook]):
http: HttpClient
page_params: PageParams
_books_per_page = 20
@field
async def category(self):
return self.css(".breadcrumb a::text").getall()[-1]
@field
async def category_rank(self):
category_rank = self.page_params.get("category_rank")
if category_rank is not None:
return category_rank
response, book_url, page = self.response, self.url, 0
category_page_url = self.css(".breadcrumb a::attr(href)").getall()[-1]
while category_page_url:
category_page_url = response.urljoin(category_page_url)
response = await self.http.get(category_page_url)
urls = response.css("h3 a::attr(href)").getall()
for position, url in enumerate(urls, start=1):
url = str(response.urljoin(url))
if url == book_url:
return page * self._books_per_page + position
category_page_url = response.css(".next a::attr(href)").get()
if not category_page_url:
return None
page += 1
In the code above, you declare a new input in CategorizedBookPage
,
page_params
, of type PageParams
.
It is a dictionary of parameters that you may receive from the code using your
page object class.
In the category_rank
field, you check if you have received a parameter also
called category_rank
, and if so, you return that value instead of using
additional requests to find the value.
You can now update your tutorial-project/run.py
script to pass that
parameter to get_item
:
item = get_item(
"http://books.toscrape.com/catalogue/the-exiled_247/index.html",
CategorizedBook,
page_params={"category_rank": 24},
)
When you execute tutorial-project/run.py
now, execution should take less
time, but the result should be the same as before:
CategorizedBook(title='The Exiled', category='Mystery', category_rank=24)
Only that now the value of category_rank
comes from
tutorial-project/run.py
, and not from additional requests sent by
CategorizedBookPage
.