I\'d like to implement some unit tests in a Scrapy (screen scraper/web crawler). Since a project is run through the \"scrapy crawl\" command I can run it through something
The newly added Spider Contracts are worth trying. It gives you a simple way to add tests without requiring a lot of code.
The way I've done it is create fake responses, this way you can test the parse function offline. But you get the real situation by using real HTML.
A problem with this approach is that your local HTML file may not reflect the latest state online. So if the HTML changes online you may have a big bug, but your test cases will still pass. So it may not be the best way to test this way.
My current workflow is, whenever there is an error I will sent an email to admin, with the url. Then for that specific error I create a html file with the content which is causing the error. Then I create a unittest for it.
This is the code I use to create sample Scrapy http responses for testing from an local html file:
# scrapyproject/tests/responses/__init__.py
import os
from scrapy.http import Response, Request
def fake_response_from_file(file_name, url=None):
"""
Create a Scrapy fake HTTP response from a HTML file
@param file_name: The relative filename from the responses directory,
but absolute paths are also accepted.
@param url: The URL of the response.
returns: A scrapy HTTP response which can be used for unittesting.
"""
if not url:
url = 'http://www.example.com'
request = Request(url=url)
if not file_name[0] == '/':
responses_dir = os.path.dirname(os.path.realpath(__file__))
file_path = os.path.join(responses_dir, file_name)
else:
file_path = file_name
file_content = open(file_path, 'r').read()
response = Response(url=url,
request=request,
body=file_content)
response.encoding = 'utf-8'
return response
The sample html file is located in scrapyproject/tests/responses/osdir/sample.html
Then the testcase could look as follows: The test case location is scrapyproject/tests/test_osdir.py
import unittest
from scrapyproject.spiders import osdir_spider
from responses import fake_response_from_file
class OsdirSpiderTest(unittest.TestCase):
def setUp(self):
self.spider = osdir_spider.DirectorySpider()
def _test_item_results(self, results, expected_length):
count = 0
permalinks = set()
for item in results:
self.assertIsNotNone(item['content'])
self.assertIsNotNone(item['title'])
self.assertEqual(count, expected_length)
def test_parse(self):
results = self.spider.parse(fake_response_from_file('osdir/sample.html'))
self._test_item_results(results, 10)
That's basically how I test my parsing methods, but its not only for parsing methods. If it gets more complex I suggest looking at Mox
I use Betamax to run test on real site the first time and keep http responses locally so that next tests run super fast after:
Betamax intercepts every request you make and attempts to find a matching request that has already been intercepted and recorded.
When you need to get latest version of site, just remove what betamax has recorded and re-run test.
Example:
from scrapy import Spider, Request
from scrapy.http import HtmlResponse
class Example(Spider):
name = 'example'
url = 'http://doc.scrapy.org/en/latest/_static/selectors-sample1.html'
def start_requests(self):
yield Request(self.url, self.parse)
def parse(self, response):
for href in response.xpath('//a/@href').extract():
yield {'image_href': href}
# Test part
from betamax import Betamax
from betamax.fixtures.unittest import BetamaxTestCase
with Betamax.configure() as config:
# where betamax will store cassettes (http responses):
config.cassette_library_dir = 'cassettes'
config.preserve_exact_body_bytes = True
class TestExample(BetamaxTestCase): # superclass provides self.session
def test_parse(self):
example = Example()
# http response is recorded in a betamax cassette:
response = self.session.get(example.url)
# forge a scrapy response to test
scrapy_response = HtmlResponse(body=response.content, url=example.url)
result = example.parse(scrapy_response)
self.assertEqual({'image_href': u'image1.html'}, result.next())
self.assertEqual({'image_href': u'image2.html'}, result.next())
self.assertEqual({'image_href': u'image3.html'}, result.next())
self.assertEqual({'image_href': u'image4.html'}, result.next())
self.assertEqual({'image_href': u'image5.html'}, result.next())
with self.assertRaises(StopIteration):
result.next()
FYI, I discover betamax at pycon 2015 thanks to Ian Cordasco's talk.
https://github.com/ThomasAitken/Scrapy-Testmaster
This is a package I wrote that significantly extends the functionality of the Scrapy Autounit library and takes it in a different direction (allowing for easy dynamic updating of testcases and merging the processes of debugging/testcase-generation). It also includes a modified version of the Scrapy parse
command (https://docs.scrapy.org/en/latest/topics/commands.html#std-command-parse)