From the ground up

Learn why and how web-poet came to be as you transform a simple, rigid starting web scraping code snippet into maintainable, reusable web-poet code.

Writing reusable parsing code

Imagine you are writing code to scrape a book web page from books.toscrape.com, and you implement a scrape function like this:

import requests
from parsel import Selector


def scrape(url: str) -> dict:
    response = requests.get(url)
    selector = Selector(response.text)
    return {
        "url": response.url,
        "title": selector.css("h1").get(),
    }

item = scrape("http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html")

This scrape function is simple, but it has a big issue: it only supports downloading the specified URL using the requests library. What if you want to use aiohttp, for concurrency support? What if you want to run scrape with a local snapshot of a URL response, to write an automated test for scrape that does not rely on a network connection?

The first step towards addressing this issue is to split your scrape function into 2 separate functions, download and parse:

import requests
from parsel import Selector


def parse(response: requests.Response) -> dict:
    selector = Selector(response.text)
    return {
        "url": response.url,
        "title": selector.css("h1").get(),
    }

def download(url: str) -> requests.Response:
    return requests.get(url)

url = "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
response = download(url)
item = parse(response)

Now that download and parse are separate functions, you can replace download with an alternative implementation that uses aiohttp, or that reads from local files.

There is still an issue, though: parse expects an instance of requests.Response. Any alternative implementation of download would need to create a response object of the same type, forcing a dependency on requests even if downloads are handled with a different library.

So you need to change the input of the parse function into something that will not tie you to a specific download library. One option is to create your own, download-independent Response class, to store the response data that any download function should be able to provide:

import requests
from dataclasses import dataclass
from parsel import Selector


@dataclass
class Response:
    url: str
    text: str


def parse(response: Response) -> dict:
    selector = Selector(response.text)
    return {
        "url": response.url,
        "title": selector.css("h1").get(),
    }


def download(url: str) -> Response:
    response = requests.get(url)
    return Response(url=response.url, text=response.text)


url = "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
response = download(url)
item = parse(response)

The parse function is no longer tied to any specific download library, and alternative versions of the download function can be implemented with other libraries.

Parsing with web-poet

web-poet asks you to organize your code in a very similar way. Let’s convert the parse function into a web-poet page object class:

import requests
from web_poet import Injectable, HttpResponse


class BookPage(Injectable):
    def __init__(self, response: HttpResponse):
        self.response = response

    def to_item(self) -> dict:
        return {
            "url": self.response.url,
            "title": self.response.css("h1").get(),
        }


def download(url: str) -> Response:
    response = requests.get(url)
    return HttpResponse(
        url=response.url,
        body=response.content,
        headers=response.headers,
    )


url = "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
response = download(url)
book_page = BookPage(response=response)
item = book_page.to_item()

Differences from a previous example:

  • web-poet provides a standard HttpResponse class, with helper methods like css().

    Note how headers are passed when creating an HttpResponse instance. This is needed to properly decode the body (which is bytes) as text using web browser rules. It involves checking the Content-Encoding header, HTML meta tags, BOM markers in the body, etc.

  • Instead of the parse function we’ve got a BookPage class, which inherits from the Injectable base class, receives response data in its __init__ method, and returns the extracted item in the to_item() method. to_item is a standard method name used by web-poet.

Receiving a response argument in __init__ is very common for page objects, so web-poet provides a shortcut for it: inherit from WebPage, which provides this __init__ method implementation. You can then refactor your BookPage class as follows:

from web_poet import WebPage

class BookPage(WebPage):
    def to_item(self) -> dict:
        return {
            "url": self.response.url,
            "title": self.response.css("h1").get(),
        }

WebPage even provides shortcuts for some response attributes and methods:

from web_poet import WebPage

class BookPage(WebPage):
    def to_item(self) -> dict:
        return {
            "url": self.url,
            "title": self.css("h1").get(),
        }

At this point you may be wondering why web-poet requires you to write a class with a to_item method rather than a function. The answer is flexibility.

For example, the use of a class instead of a function makes fields possible, which make parsing code easier to read:

from web_poet import WebPage, field


class BookPage(WebPage):
    @field
    def url(self):
        return self.url

    @field
    def title(self):
        return self.css("h1").get()

Using fields also makes it unnecessary to define to_item() manually, and allows reading individual fields when you don’t need the complete to_item() output.

Note

The BookPage.to_item() method is async in the example above. See Fields for more information.

Using classes also makes it easy, for example, to implement dependency injection, which is how web-poet builds inputs.

Downloading with web-poet

What about the implementation of the download function? How would you implement that in web-poet? Well, ideally, you wouldn’t.

To parse data from a web page using web-poet, you would only need to write the parsing part, e.g. the BookPage page object class above.

Then, you let a web-poet framework handle the download part for you. You pass that framework the URL of a web page to parse, and either a page object class (the BookPage class here) or an item class, and that’s it:

item = some_framework.get(url, BookPage)

web-poet does not provide any framework, beyond an example one featured in the tutorial and not intended for production. The role of web-poet is to define a specification on how to write parsing logic so that it can be reused with different frameworks.

Page object classes should be flexible enough to be used with very different frameworks, including:

  • synchronous or asynchronous frameworks

  • asynchronous frameworks based on callbacks or based on coroutines (async def / await syntax)

  • single-node and distributed systems

  • different underlying HTTP implementations, or even implementations with no HTTP support at all