.. _web-poet-testing:

======================
Tests for page objects
======================

Page Objects that inherit from :class:`~.ItemPage` can be tested by saving the
dependencies needed to create one and the result of
:meth:`~web_poet.pages.ItemPage.to_item`, recreating the Page Object from the
dependencies, running its :meth:`~web_poet.pages.ItemPage.to_item` and
comparing the result to the saved one. ``web-poet`` provides the following
tools for this:

* dependency serialization into a Python object and into a set of files;
* recreating Page Objects from the serialized dependencies;
* a high-level function to save a test fixture;
* a plugin for ``pytest 7.0.0`` and higher that discovers fixtures and runs
  tests for them.

.. _dep-serialization:

Serialization
=============

:func:`web_poet.serialization.serialize` can be used to serialize an iterable
of Page Object dependencies to a Python object.
:func:`web_poet.serialization.deserialize` can be used to recreate a Page
Object from this serialized data.

An instance of :class:`web_poet.serialization.SerializedDataFileStorage` can be
used to write the serialized data to a set of files in a given directory and to
read it back.

.. note::
    We only support serializing dependencies, not Page Object instances,
    because the only universal way to recreate a Page Object is from its
    dependencies, not from some saved internal state.

Each dependency is serialized to one or several ``bytes`` objects, each of
which is saved as a single file. :func:`web_poet.serialization.serialize_leaf`
and :func:`web_poet.serialization.deserialize_leaf` are used to convert between
a dependency and this set of ``bytes`` objects. They are implemented using
:func:`functools.singledispatch` and while the types provided by ``web-poet``
are supported out of the box, user-defined types need a pair of implementation
functions that need to be registered using
:func:`web_poet.serialization.register_serialization`.

.. _fixtures:

Fixtures
========

The provided ``pytest`` plugin expects fixtures in a certain layout. A set of
fixtures for a single Page Object should be contained in a directory named as
that Page Object fully qualified class name. Each fixture is a directory inside
it, that contains data for Page Object inputs and output::

    fixtures
    └── my_project.pages.MyItemPage
        ├── test-1
        │   ├── inputs
            │   ├── HttpClient.exists
        │   │   ├── HttpResponse-body.html
        │   │   ├── HttpResponse-info.json
        │   │   └── ResponseUrl.txt
        │   ├── meta.json
        │   └── output.json
        └─── test-2
            ├── inputs
            │   ├── HttpClient.exists
            │   ├── HttpClient-0-HttpRequest.info.json
            │   ├── HttpClient-0-HttpResponse.body.html
            │   ├── HttpClient-0-HttpResponse.info.json
            │   ├── HttpClient-1-HttpRequest.body.txt
            │   ├── HttpClient-1-HttpRequest.info.json
            │   ├── HttpClient-1-HttpResponse.body.html
            │   ├── HttpClient-1-HttpResponse.info.json
            │   ├── HttpResponse-body.html
            │   ├── HttpResponse-info.json
            │   └── ResponseUrl.txt
            ├── meta.json
            └── output.json

.. _fixture-save:

:func:`web_poet.testing.Fixture.save` can be used to create a fixture inside a
Page Object directory from an iterable of dependencies, an output item and an
optional metadata dictionary. It can optionally take a name for the fixture
directory. By default it uses incrementing names "test-1", "test-2" etc.

.. note::
    ``output.json`` contains a result of ``page_object.to_item()`` converted to
    a dict using the itemadapter_ library and saved as JSON.

After generating a fixture you can edit ``output.json`` to modify expected
field values and add new fields, which is useful when creating tests for code
that isn't written yet or before modifying its behavior.

.. _web-poet-testing-scrapy-poet:

scrapy-poet integration
=======================

Projects that use the `scrapy-poet`_ library can use the :ref:`Scrapy command
<scrapy-poet:testing>` provided by it to generate fixtures in a convenient way.
It's available starting with scrapy-poet 0.8.0.

.. _scrapy-poet: https://github.com/scrapinghub/scrapy-poet

.. _web-poet-testing-pytest:

Running tests
=============

The provided ``pytest`` plugin is automatically registered when ``web-poet`` is
installed, and running ``python -m pytest`` in a directory containing fixtures
will discover them and run tests for them.

By default, the plugin generates:

* a test which checks that ``to_item()`` doesn't raise an exception
  (i.e. it can be executed),
* a test per each output attribute of the item,
* an additional test to check that there are no extra attributes in the output.

For example, if your item has 5 attributes, and you created 2 fixtures, pytest
will run (5+1+1)*2 = 14 tests. This allows to report failures for individual
fields separately.

If ``to_item`` raises an error, there is no point in running other tests,
so they're skipped in this case.

If you prefer less granular test failure reporting, you can use pytest with
the ``--web-poet-test-per-item`` option::

    python -m pytest --web-poet-test-per-item

In this case there is going to be a single test per fixture: if the result
is not fully correct, the test fails. So, following the previous example,
it'd be 2 tests instead of 14.

.. _web-poet-testing-tdd:

Test-Driven Development
=======================

You can follow TDD (Test-Driven Development) approach to develop your
page objects. To do so,

1. Generate a fixture (see :ref:`web-poet-testing-scrapy-poet`).
2. Populate ``output.json`` with the correct expected output.
3. Run the tests (see :ref:`web-poet-testing-pytest`) and update the code
   until all tests pass. It's convenient to use web-poet :ref:`fields`,
   and implement extraction field-by-field, because you'll be getting
   an additional test passing after each field is implemented.

This approach allows a fast feedback loop: there is no need to download page
multiple times, and you have a clear progress indication for your work
(number of failing tests remaining). Also, in the end you get
a regression test, which can be helpful later.

Sometimes it may be awkward to set the correct value in JSON before starting
the development, especially if a value is large or has a complex structure.
For example, this could be the case for e-commerce product description field,
which can be hard to copy-paste from the website, and which may have various
whitespace normalization rules which you need to apply.

In this case, it may be more convenient to implement the extraction first,
and only then populate the ``output.json`` file with the correct value.

You can use ``python -m web_poet.testing rerun <fixture_path>`` command
in this case, to re-run the page object using the inputs saved in a fixture.
This command prints output of the page object, as JSON; you can then copy-paste
relevant parts to the ``output.json`` file. It's also possible to make
the command print only some of the fields. For example, you might run the
following command after implementing extraction for "description" and
"descriptionHtml" fields in ``my_project.pages.MyItemPage``::

    python -m web_poet.testing rerun \
        fixtures/my_project.pages.MyItemPage/test-1 \
        --fields description,descriptionHtml

It may output something like this::

    {
        "description": "..description of the product..",
        "descriptionHtml": "<p>...</p>"
    }

If these values look good, you can update
``fixtures/my_project.pages.MyItemPage/test-1/output.json`` file
with these values.

.. _web-poet-testing-frozen_time:

Handling time fields
====================

Sometimes output of a page object might depend on the current time. For
example, the item may contain the scraping datetime, or a current timestamp may
be used to build some URLs. When a test runs at a different time it will break.
To avoid this :ref:`the metadata dictionary <fixture-save>` can contain a
``frozen_time`` field set to the time value used when generating the test. This
will instruct the test runner to use the same time value so that field
comparisons are still correct.

The value can be any string understood by `dateutil`_. If it doesn't include
timezone information, the local time of the machine will be assumed. If it
includes timezone information, on non-Windows systems the test process will be
executed in that timezone, so that output fields that contain local time are
correct. On Windows systems (where changing the process timezone is not
possible) the time value will be converted to the local time of the machine,
and such fields will containt wrong data if these timezones don't match.
Consider an example item::

    import datetime
    from web_poet import WebPage, validates_input

    class DateItemPage(WebPage):
        @validates_input
        async def to_item(self) -> dict:
            # e.g. 2001-01-01 11:00:00 +00
            now = datetime.datetime.now(datetime.timezone.utc)
            return {
                # '2001-01-01T11:00:00Z'
                "time_utc": now.strftime("%Y-%M-%dT%H:%M:%SZ"),
                # if the current timezone is CET, then '2001-01-01T12:00:00+01:00'
                "time_local": now.astimezone().strftime("%Y-%M-%dT%H:%M:%S%z"),
            }

We will assume that the fixture was generated in CET (UTC+1).

* If the fixture doesn't have the ``frozen_time`` metadata field, the item will
  simply contain the current time and the test will always fail.
* If ``frozen_time`` doesn't contain the timezone data (e.g. it is
  ``2001-01-01T11:00:00``), the item will depend on the machine timezone: in
  CET it will contain the expected values, in timezones with a different offset
  ``time_local`` will be different.
* If ``frozen_time`` contains the timezone data and the system is not Windows,
  the ``time_local`` field will contain the date in that timezone, so if the
  timezone in ``frozen_time`` is not UTC+1, the test will fail.
* If the system is Windows, the ``frozen_time`` value will be converted to the
  machine timezone, so the item will depend on that timezone, just like when
  ``frozen_time`` doesn't contain the timezone data, and ``time_local`` will
  similarly be only correct if the machine timezone has the same offset as CET.

This means that most combinations of setups will work if ``frozen_time``
contains the timezone data, except for running tests on Windows, in which case
the machine timezone should match the timezone in ``frozen_time``. Also, if
items do not depend on the machine timezone (e.g. if all datetime-derived data
they contain is in UTC), the tests for them should work everywhere.

There is an additional limitation which we plan to fix in future versions. The
time is set to the ``frozen_time`` value when the test generation (if using the
``scrapy-poet`` command) or the test run starts, but it ticks during the
generation/run itself, so if it takes more than 1 second (which is quite
possible even in simple cases) the time fields will have values several seconds
later than ``frozen_time``. For now we recommend to work around this problem by
manually editing the ``output.json`` file to put the value equal to
``frozen_time`` in these fields, as running the test shoudn't take more than 1
second.

.. _dateutil: https://github.com/dateutil/dateutil

.. _git-lfs:

Storing fixtures in Git
=======================

Fixtures can take a lot of disk space, as they usually include page responses
and may include other large files, so we recommend using `Git LFS`_ when
storing them in Git repos to reduce the repo space and get other performance
benefits. Even if your fixtures are currently small, it may be useful to do
this from the beginning, as migrating files to LFS is not easy and requires
rewriting the repo history.

To use Git LFS you need a Git hosting provider that supports it, and major
providers and software (e.g. GitHub, Bitbucket, GitLab) support it. There are
also `implementations`_ for standalone Git servers.

Assuming you store the fixtures in the directory named "fixtures" in the repo
root, the workflow should be as following. Enable normal diffs for LFS files in
this repo::

  git config diff.lfs.textconv cat

Enable LFS for the fixtures directory before committing anything in it::

  git lfs track "fixtures/**"

Commit the ``.gitattributes`` file (which stores the tracking information)::

  git add .gitattributes
  git commit

After generating the fixtures just commit them as usual::

  git add fixtures/test-1
  git commit

After this all usual commands including ``push``, ``pull`` or ``checkout``
should work as expected on these files.

Please also check the official Git LFS documentation for more information.

.. _Git LFS: https://git-lfs.com/
.. _implementations: https://github.com/git-lfs/git-lfs/wiki/Implementations

.. _web-poet-testing-additional-requests:

Additional requests support
===========================

If the page object uses the :class:`~.HttpClient` dependency to make
:ref:`additional requests <additional-requests>`, the generated fixtures will
contain these requests and their responses (or exceptions raised when the
response is not received). When the test runs, :class:`~.HttpClient` will
return the saved responses without doing actual requests.

Currently requests are compared by their URL, method, headers and body, so if a
page object makes requests that differ between runs, the test won't be able to
find a saved response and will fail.

Test coverage
=============

The coverage for page object code is reported correctly if tools such as
`coverage`_ are used when running web-poet tests.

.. _coverage: https://coverage.readthedocs.io/

.. _web-poet-testing-adapters:

Item adapters
=============

The testing framework uses the itemadapter_ library to convert items to dicts
when storing them in fixtures and when comparing the expected and the actual
output. As adapters may influence the resulting dicts, it's important to use
the same adapter when generating and running the tests.

It may also be useful to use different adapters in tests and in production. For
example, you may want to omit empty fields in production, but be able to
distinguish between empty and absent fields in tests.

For this you can set the ``adapter`` field in :ref:`the metadata dictionary
<fixture-save>` to the class that inherits from
:class:`itemadapter.ItemAdapter` and has the adapter(s) you want to use in
tests in its ``ADAPTER_CLASSES`` attribute (see `the relevant itemadapter
docs`_ for more information). An example::

    from collections import deque

    from itemadapter import ItemAdapter
    from itemadapter.adapter import DictAdapter


    class MyAdapter(DictAdapter):
        # any needed customization
        ...

    class MyItemAdapter(ItemAdapter):
        ADAPTER_CLASSES = deque([MyAdapter])

You can then put the ``MyItemAdapter`` class object into ``adapter`` and it
will be used by the testing framework.

If ``adapter`` is not set,
:class:`~web_poet.testing.itemadapter.WebPoetTestItemAdapter` will be used.
It works like :class:`itemadapter.ItemAdapter` but doesn't change behavior when
:attr:`itemadapter.ItemAdapter.ADAPTER_CLASSES` is modified.

.. _itemadapter: https://github.com/scrapy/itemadapter
.. _the relevant itemadapter docs: https://github.com/scrapy/itemadapter/#multiple-adapter-classes

.. _web-poet-testing-user-props:

pytest user properties
======================

After a test run the following `pytest user properties`_ are available:

* on per-field tests the ``expected_value`` and ``actual_value`` properties
  contain JSON-encoded expected and actual field values
* on expected exception tests the ``expected_exception`` and
  ``actual_exception`` properties contain JSON-encoded dicts for expected and
  actual exceptions, with the ``import_path`` field containing the import path
  of the exception class and the ``msg`` field containing the first argument of
  the exception instance.

The main use case for this is generating a `JUnitXML report`_ and getting the
values from the ``/testsuites/testsuite/testcase/properties/property`` nodes.

.. _pytest user properties: https://docs.pytest.org/en/stable/reference/reference.html#pytest.Item.user_properties
.. _JUnitXML report: https://docs.pytest.org/en/stable/how-to/output.html#creating-junitxml-format-files