Scrapy Unit Testing

前端 未结 10 1390
逝去的感伤
逝去的感伤 2020-11-30 18:18

I\'d like to implement some unit tests in a Scrapy (screen scraper/web crawler). Since a project is run through the \"scrapy crawl\" command I can run it through something

相关标签:
10条回答
  • 2020-11-30 18:41

    The newly added Spider Contracts are worth trying. It gives you a simple way to add tests without requiring a lot of code.

    0 讨论(0)
  • 2020-11-30 18:45

    The way I've done it is create fake responses, this way you can test the parse function offline. But you get the real situation by using real HTML.

    A problem with this approach is that your local HTML file may not reflect the latest state online. So if the HTML changes online you may have a big bug, but your test cases will still pass. So it may not be the best way to test this way.

    My current workflow is, whenever there is an error I will sent an email to admin, with the url. Then for that specific error I create a html file with the content which is causing the error. Then I create a unittest for it.

    This is the code I use to create sample Scrapy http responses for testing from an local html file:

    # scrapyproject/tests/responses/__init__.py
    
    import os
    
    from scrapy.http import Response, Request
    
    def fake_response_from_file(file_name, url=None):
        """
        Create a Scrapy fake HTTP response from a HTML file
        @param file_name: The relative filename from the responses directory,
                          but absolute paths are also accepted.
        @param url: The URL of the response.
        returns: A scrapy HTTP response which can be used for unittesting.
        """
        if not url:
            url = 'http://www.example.com'
    
        request = Request(url=url)
        if not file_name[0] == '/':
            responses_dir = os.path.dirname(os.path.realpath(__file__))
            file_path = os.path.join(responses_dir, file_name)
        else:
            file_path = file_name
    
        file_content = open(file_path, 'r').read()
    
        response = Response(url=url,
            request=request,
            body=file_content)
        response.encoding = 'utf-8'
        return response
    

    The sample html file is located in scrapyproject/tests/responses/osdir/sample.html

    Then the testcase could look as follows: The test case location is scrapyproject/tests/test_osdir.py

    import unittest
    from scrapyproject.spiders import osdir_spider
    from responses import fake_response_from_file
    
    class OsdirSpiderTest(unittest.TestCase):
    
        def setUp(self):
            self.spider = osdir_spider.DirectorySpider()
    
        def _test_item_results(self, results, expected_length):
            count = 0
            permalinks = set()
            for item in results:
                self.assertIsNotNone(item['content'])
                self.assertIsNotNone(item['title'])
            self.assertEqual(count, expected_length)
    
        def test_parse(self):
            results = self.spider.parse(fake_response_from_file('osdir/sample.html'))
            self._test_item_results(results, 10)
    

    That's basically how I test my parsing methods, but its not only for parsing methods. If it gets more complex I suggest looking at Mox

    0 讨论(0)
  • 2020-11-30 18:45

    I use Betamax to run test on real site the first time and keep http responses locally so that next tests run super fast after:

    Betamax intercepts every request you make and attempts to find a matching request that has already been intercepted and recorded.

    When you need to get latest version of site, just remove what betamax has recorded and re-run test.

    Example:

    from scrapy import Spider, Request
    from scrapy.http import HtmlResponse
    
    
    class Example(Spider):
        name = 'example'
    
        url = 'http://doc.scrapy.org/en/latest/_static/selectors-sample1.html'
    
        def start_requests(self):
            yield Request(self.url, self.parse)
    
        def parse(self, response):
            for href in response.xpath('//a/@href').extract():
                yield {'image_href': href}
    
    
    # Test part
    from betamax import Betamax
    from betamax.fixtures.unittest import BetamaxTestCase
    
    
    with Betamax.configure() as config:
        # where betamax will store cassettes (http responses):
        config.cassette_library_dir = 'cassettes'
        config.preserve_exact_body_bytes = True
    
    
    class TestExample(BetamaxTestCase):  # superclass provides self.session
    
        def test_parse(self):
            example = Example()
    
            # http response is recorded in a betamax cassette:
            response = self.session.get(example.url)
    
            # forge a scrapy response to test
            scrapy_response = HtmlResponse(body=response.content, url=example.url)
    
            result = example.parse(scrapy_response)
    
            self.assertEqual({'image_href': u'image1.html'}, result.next())
            self.assertEqual({'image_href': u'image2.html'}, result.next())
            self.assertEqual({'image_href': u'image3.html'}, result.next())
            self.assertEqual({'image_href': u'image4.html'}, result.next())
            self.assertEqual({'image_href': u'image5.html'}, result.next())
    
            with self.assertRaises(StopIteration):
                result.next()
    

    FYI, I discover betamax at pycon 2015 thanks to Ian Cordasco's talk.

    0 讨论(0)
  • 2020-11-30 18:48

    https://github.com/ThomasAitken/Scrapy-Testmaster

    This is a package I wrote that significantly extends the functionality of the Scrapy Autounit library and takes it in a different direction (allowing for easy dynamic updating of testcases and merging the processes of debugging/testcase-generation). It also includes a modified version of the Scrapy parse command (https://docs.scrapy.org/en/latest/topics/commands.html#std-command-parse)

    0 讨论(0)
提交回复
热议问题