.. _from-ground-up: ================== From the ground up ================== Learn why and how web-poet came to be as you transform a simple, rigid starting web scraping code snippet into maintainable, reusable web-poet code. Writing reusable parsing code ============================= Imagine you are writing code to scrape a book web page from `books.toscrape.com `_, and you implement a ``scrape`` function like this: .. code-block:: python import requests from parsel import Selector def scrape(url: str) -> dict: response = requests.get(url) selector = Selector(response.text) return { "url": response.url, "title": selector.css("h1").get(), } item = scrape( "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html" ) This ``scrape`` function is simple, but it has a big issue: it only supports downloading the specified URL using the requests_ library. What if you want to use aiohttp_, for concurrency support? What if you want to run ``scrape`` with a local snapshot of a URL response, to write an automated test for ``scrape`` that does not rely on a network connection? .. _aiohttp: https://github.com/aio-libs/aiohttp .. _requests: https://requests.readthedocs.io/en/latest/ The first step towards addressing this issue is to split your ``scrape`` function into 2 separate functions, ``download`` and ``parse``: .. code-block:: python import requests from parsel import Selector def parse(response: requests.Response) -> dict: selector = Selector(response.text) return { "url": response.url, "title": selector.css("h1").get(), } def download(url: str) -> requests.Response: return requests.get(url) url = "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html" response = download(url) item = parse(response) Now that ``download`` and ``parse`` are separate functions, you can replace ``download`` with an alternative implementation that uses aiohttp_, or that reads from local files. There is still an issue, though: ``parse`` expects an instance of `requests.Response`_. Any alternative implementation of ``download`` would need to create a response object of the same type, forcing a dependency on requests_ even if downloads are handled with a different library. .. _requests.Response: https://requests.readthedocs.io/en/latest/api/#requests.Response So you need to change the input of the ``parse`` function into something that will not tie you to a specific download library. One option is to create your own, download-independent ``Response`` class, to store the response data that any download function should be able to provide: .. code-block:: python import requests from dataclasses import dataclass from parsel import Selector @dataclass class Response: url: str text: str def parse(response: Response) -> dict: selector = Selector(response.text) return { "url": response.url, "title": selector.css("h1").get(), } def download(url: str) -> Response: response = requests.get(url) return Response(url=response.url, text=response.text) url = "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html" response = download(url) item = parse(response) The ``parse`` function is no longer tied to any specific download library, and alternative versions of the ``download`` function can be implemented with other libraries. Parsing with web-poet ===================== web-poet asks you to organize your code in a very similar way. Let’s convert the ``parse`` function into a :ref:`web-poet page object class `: .. code-block:: python import requests from web_poet import Injectable, HttpResponse class BookPage(Injectable): def __init__(self, response: HttpResponse): self.response = response def to_item(self) -> dict: return { "url": self.response.url, "title": self.response.css("h1").get(), } def download(url: str) -> Response: response = requests.get(url) return HttpResponse( url=response.url, body=response.content, headers=response.headers, ) url = "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html" response = download(url) book_page = BookPage(response=response) item = book_page.to_item() Differences from a previous example: - web-poet provides a standard :class:`~.HttpResponse` class, with helper methods like :meth:`~.HttpResponse.css`. Note how headers are passed when creating an :class:`~.HttpResponse` instance. This is needed to properly decode the body (which is ``bytes``) as text using web browser rules. It involves checking the ``Content-Encoding`` header, HTML meta tags, BOM markers in the body, etc. - Instead of the ``parse`` function we've got a ``BookPage`` class, which inherits from the :class:`~.Injectable` base class, receives response data in its ``__init__`` method, and returns the extracted item in the ``to_item()`` method. ``to_item`` is a standard method name used by ``web-poet``. Receiving a ``response`` argument in ``__init__`` is very common for page objects, so ``web-poet`` provides a shortcut for it: inherit from :class:`~.WebPage`, which provides this ``__init__`` method implementation. You can then refactor your ``BookPage`` class as follows: .. code-block:: python from web_poet import WebPage class BookPage(WebPage): def to_item(self) -> dict: return { "url": self.response.url, "title": self.response.css("h1").get(), } :class:`~.WebPage` even provides shortcuts for some response attributes and methods: .. code-block:: python from web_poet import WebPage class BookPage(WebPage): def to_item(self) -> dict: return { "url": self.url, "title": self.css("h1").get(), } At this point you may be wondering why web-poet requires you to write a class with a ``to_item`` method rather than a function. The answer is flexibility. For example, the use of a class instead of a function makes :ref:`fields ` possible, which make parsing code easier to read: .. code-block:: python from web_poet import WebPage, field class BookPage(WebPage): @field def url(self): return self.url @field def title(self): return self.css("h1").get() Using fields also makes it unnecessary to define ``to_item()`` manually, and allows reading individual fields when you don't need the complete ``to_item()`` output. .. note:: The ``BookPage.to_item()`` method is ``async`` in the example above. See :ref:`fields` for more information. Using classes also makes it easy, for example, to implement dependency injection, which is how web-poet builds :ref:`inputs `. Downloading with web-poet ========================= What about the implementation of the ``download`` function? How would you implement that in web-poet? Well, ideally, you wouldn’t. To parse data from a web page using web-poet, you would only need to write the parsing part, e.g. the ``BookPage`` :ref:`page object class ` above. Then, you let a :ref:`web-poet framework ` handle the download part for you. You pass that framework the URL of a web page to parse, and either a page object class (the ``BookPage`` class here) or an :ref:`item class `, and that's it: .. code-block:: python item = some_framework.get(url, BookPage) web-poet does *not* provide any framework, beyond :ref:`an example one featured in the tutorial ` and not intended for production. The role of web-poet is to define a specification on how to write parsing logic so that it can be reused with different frameworks. :ref:`Page object classes ` should be flexible enough to be used with very different frameworks, including: - synchronous or asynchronous frameworks - asynchronous frameworks based on callbacks or based on coroutines_ (``async def / await`` syntax) .. _coroutines: https://docs.python.org/3/library/asyncio-task.html - single-node and distributed systems - different underlying HTTP implementations, or even implementations with no HTTP support at all