Scrapy Unit Testing

前端 未结 10 1389
逝去的感伤
逝去的感伤 2020-11-30 18:18

I\'d like to implement some unit tests in a Scrapy (screen scraper/web crawler). Since a project is run through the \"scrapy crawl\" command I can run it through something

相关标签:
10条回答
  • 2020-11-30 18:23

    Similar to Hadrien's answer but for pytest: pytest-vcr.

    import requests
    import pytest
    from scrapy.http import HtmlResponse
    
    @pytest.mark.vcr()
    def test_parse(url, target):
        response = requests.get(url)
        scrapy_response = HtmlResponse(url, body=response.content)
        assert Spider().parse(scrapy_response) == target
    
    
    0 讨论(0)
  • 2020-11-30 18:29

    You can follow this snippet from the scrapy site to run it from a script. Then you can make any kind of asserts you'd like on the returned items.

    0 讨论(0)
  • 2020-11-30 18:30

    This is a very late answer but I've been annoyed with scrapy testing so I wrote scrapy-test a framework for testing scrapy crawlers against defined specifications.

    It works by defining test specifications rather than static output. For example if we are crawling this sort of item:

    {
        "name": "Alex",
        "age": 21,
        "gender": "Female",
    }
    

    We can defined scrapy-test ItemSpec:

    from scrapytest.tests import Match, MoreThan, LessThan
    from scrapytest.spec import ItemSpec
    
    class MySpec(ItemSpec):
        name_test = Match('{3,}')  # name should be at least 3 characters long
        age_test = Type(int), MoreThan(18), LessThan(99)
        gender_test = Match('Female|Male')
    

    There's also same idea tests for scrapy stats as StatsSpec:

    from scrapytest.spec import StatsSpec
    from scrapytest.tests import Morethan
    
    class MyStatsSpec(StatsSpec):
        validate = {
            "item_scraped_count": MoreThan(0),
        }
    

    Afterwards it can be run against live or cached results:

    $ scrapy-test 
    # or
    $ scrapy-test --cache
    

    I've been running cached runs for development changes and daily cronjobs for detecting website changes.

    0 讨论(0)
  • 2020-11-30 18:35

    I'm using Twisted's trial to run tests, similar to Scrapy's own tests. It already starts a reactor, so I make use of the CrawlerRunner without worrying about starting and stopping one in the tests.

    Stealing some ideas from the check and parse Scrapy commands I ended up with the following base TestCase class to run assertions against live sites:

    from twisted.trial import unittest
    
    from scrapy.crawler import CrawlerRunner
    from scrapy.http import Request
    from scrapy.item import BaseItem
    from scrapy.utils.spider import iterate_spider_output
    
    class SpiderTestCase(unittest.TestCase):
        def setUp(self):
            self.runner = CrawlerRunner()
    
        def make_test_class(self, cls, url):
            """
            Make a class that proxies to the original class,
            sets up a URL to be called, and gathers the items
            and requests returned by the parse function.
            """
            class TestSpider(cls):
                # This is a once used class, so writing into
                # the class variables is fine. The framework
                # will instantiate it, not us.
                items = []
                requests = []
    
                def start_requests(self):
                    req = super(TestSpider, self).make_requests_from_url(url)
                    req.meta["_callback"] = req.callback or self.parse
                    req.callback = self.collect_output
                    yield req
    
                def collect_output(self, response):
                    try:
                        cb = response.request.meta["_callback"]
                        for x in iterate_spider_output(cb(response)):
                            if isinstance(x, (BaseItem, dict)):
                                self.items.append(x)
                            elif isinstance(x, Request):
                                self.requests.append(x)
                    except Exception as ex:
                        print("ERROR", "Could not execute callback: ",     ex)
                        raise ex
    
                    # Returning any requests here would make the     crawler follow them.
                    return None
    
            return TestSpider
    

    Example:

    @defer.inlineCallbacks
    def test_foo(self):
        tester = self.make_test_class(FooSpider, 'https://foo.com')
        yield self.runner.crawl(tester)
        self.assertEqual(len(tester.items), 1)
        self.assertEqual(len(tester.requests), 2)
    

    or perform one request in the setup and run multiple tests against the results:

    @defer.inlineCallbacks
    def setUp(self):
        super(FooTestCase, self).setUp()
        if FooTestCase.tester is None:
            FooTestCase.tester = self.make_test_class(FooSpider, 'https://foo.com')
            yield self.runner.crawl(self.tester)
    
    def test_foo(self):
        self.assertEqual(len(self.tester.items), 1)
    
    0 讨论(0)
  • 2020-11-30 18:36

    I'm using scrapy 1.3.0 and the function: fake_response_from_file, raise an error:

    response = Response(url=url, request=request, body=file_content)
    

    I get:

    raise AttributeError("Response content isn't text")
    

    The solution is to use TextResponse instead, and it works ok, as example:

    response = TextResponse(url=url, request=request, body=file_content)     
    

    Thanks a lot.

    0 讨论(0)
  • 2020-11-30 18:37

    Slightly simpler, by removing the def fake_response_from_file from the chosen answer:

    import unittest
    from spiders.my_spider import MySpider
    from scrapy.selector import Selector
    
    
    class TestParsers(unittest.TestCase):
    
    
        def setUp(self):
            self.spider = MySpider(limit=1)
            self.html = Selector(text=open("some.htm", 'r').read())
    
    
        def test_some_parse(self):
            expected = "some-text"
            result = self.spider.some_parse(self.html)
            self.assertEqual(result, expected)
    
    
    if __name__ == '__main__':
        unittest.main()
    
    0 讨论(0)
提交回复
热议问题