Scrapy Unit Testing

前端未结

关注

 10  1405

I\'d like to implement some unit tests in a Scrapy (screen scraper/web crawler). Since a project is run through the \"scrapy crawl\" command I can run it through something

相关标签:

10条回答

伪装坚强ぢ

2020-11-30 18:23

Similar to Hadrien's answer but for pytest: pytest-vcr.

import requests
import pytest
from scrapy.http import HtmlResponse

@pytest.mark.vcr()
def test_parse(url, target):
    response = requests.get(url)
    scrapy_response = HtmlResponse(url, body=response.content)
    assert Spider().parse(scrapy_response) == target

0 讨论(0)

囚心锁ツ

2020-11-30 18:29

You can follow this snippet from the scrapy site to run it from a script. Then you can make any kind of asserts you'd like on the returned items.

0 讨论(0)
发布评论:

提交评论
- 加载中...
迷失自我

2020-11-30 18:30
This is a very late answer but I've been annoyed with scrapy testing so I wrote scrapy-test a framework for testing scrapy crawlers against defined specifications.

It works by defining test specifications rather than static output. For example if we are crawling this sort of item:
```
{
    "name": "Alex",
    "age": 21,
    "gender": "Female",
}
```
We can defined scrapy-test ItemSpec:
```
from scrapytest.tests import Match, MoreThan, LessThan
from scrapytest.spec import ItemSpec

class MySpec(ItemSpec):
    name_test = Match('{3,}')  # name should be at least 3 characters long
    age_test = Type(int), MoreThan(18), LessThan(99)
    gender_test = Match('Female|Male')
```
There's also same idea tests for scrapy stats as StatsSpec:
```
from scrapytest.spec import StatsSpec
from scrapytest.tests import Morethan

class MyStatsSpec(StatsSpec):
    validate = {
        "item_scraped_count": MoreThan(0),
    }
```
Afterwards it can be run against live or cached results:
```
$ scrapy-test 
# or
$ scrapy-test --cache
```
I've been running cached runs for development changes and daily cronjobs for detecting website changes.
0 讨论(0)
发布评论:

提交评论
- 加载中...

清歌不尽

2020-11-30 18:35

I'm using Twisted's trial to run tests, similar to Scrapy's own tests. It already starts a reactor, so I make use of the CrawlerRunner without worrying about starting and stopping one in the tests.

Stealing some ideas from the check and parse Scrapy commands I ended up with the following base TestCase class to run assertions against live sites:

from twisted.trial import unittest

from scrapy.crawler import CrawlerRunner
from scrapy.http import Request
from scrapy.item import BaseItem
from scrapy.utils.spider import iterate_spider_output

class SpiderTestCase(unittest.TestCase):
    def setUp(self):
        self.runner = CrawlerRunner()

    def make_test_class(self, cls, url):
        """
        Make a class that proxies to the original class,
        sets up a URL to be called, and gathers the items
        and requests returned by the parse function.
        """
        class TestSpider(cls):
            # This is a once used class, so writing into
            # the class variables is fine. The framework
            # will instantiate it, not us.
            items = []
            requests = []

            def start_requests(self):
                req = super(TestSpider, self).make_requests_from_url(url)
                req.meta["_callback"] = req.callback or self.parse
                req.callback = self.collect_output
                yield req

            def collect_output(self, response):
                try:
                    cb = response.request.meta["_callback"]
                    for x in iterate_spider_output(cb(response)):
                        if isinstance(x, (BaseItem, dict)):
                            self.items.append(x)
                        elif isinstance(x, Request):
                            self.requests.append(x)
                except Exception as ex:
                    print("ERROR", "Could not execute callback: ",     ex)
                    raise ex

                # Returning any requests here would make the     crawler follow them.
                return None

        return TestSpider

Example:

@defer.inlineCallbacks
def test_foo(self):
    tester = self.make_test_class(FooSpider, 'https://foo.com')
    yield self.runner.crawl(tester)
    self.assertEqual(len(tester.items), 1)
    self.assertEqual(len(tester.requests), 2)

or perform one request in the setup and run multiple tests against the results:

@defer.inlineCallbacks
def setUp(self):
    super(FooTestCase, self).setUp()
    if FooTestCase.tester is None:
        FooTestCase.tester = self.make_test_class(FooSpider, 'https://foo.com')
        yield self.runner.crawl(self.tester)

def test_foo(self):
    self.assertEqual(len(self.tester.items), 1)

0 讨论(0)

花落未央

2020-11-30 18:36
I'm using scrapy 1.3.0 and the function: fake_response_from_file, raise an error:
```
response = Response(url=url, request=request, body=file_content)
```
I get:
```
raise AttributeError("Response content isn't text")
```
The solution is to use TextResponse instead, and it works ok, as example:
```
response = TextResponse(url=url, request=request, body=file_content)     
```
Thanks a lot.
0 讨论(0)
发布评论:

提交评论
- 加载中...

轮回少年

2020-11-30 18:37

Slightly simpler, by removing the def fake_response_from_file from the chosen answer:

import unittest
from spiders.my_spider import MySpider
from scrapy.selector import Selector


class TestParsers(unittest.TestCase):


    def setUp(self):
        self.spider = MySpider(limit=1)
        self.html = Selector(text=open("some.htm", 'r').read())


    def test_some_parse(self):
        expected = "some-text"
        result = self.spider.some_parse(self.html)
        self.assertEqual(result, expected)


if __name__ == '__main__':
    unittest.main()

0 讨论(0)

1 2 下一页