I\'d like to implement some unit tests in a Scrapy (screen scraper/web crawler). Since a project is run through the \"scrapy crawl\" command I can run it through something
Similar to Hadrien's answer but for pytest: pytest-vcr.
import requests
import pytest
from scrapy.http import HtmlResponse
@pytest.mark.vcr()
def test_parse(url, target):
response = requests.get(url)
scrapy_response = HtmlResponse(url, body=response.content)
assert Spider().parse(scrapy_response) == target
You can follow this snippet from the scrapy site to run it from a script. Then you can make any kind of asserts you'd like on the returned items.
This is a very late answer but I've been annoyed with scrapy testing so I wrote scrapy-test a framework for testing scrapy crawlers against defined specifications.
It works by defining test specifications rather than static output. For example if we are crawling this sort of item:
{
"name": "Alex",
"age": 21,
"gender": "Female",
}
We can defined scrapy-test ItemSpec
:
from scrapytest.tests import Match, MoreThan, LessThan
from scrapytest.spec import ItemSpec
class MySpec(ItemSpec):
name_test = Match('{3,}') # name should be at least 3 characters long
age_test = Type(int), MoreThan(18), LessThan(99)
gender_test = Match('Female|Male')
There's also same idea tests for scrapy stats as StatsSpec
:
from scrapytest.spec import StatsSpec
from scrapytest.tests import Morethan
class MyStatsSpec(StatsSpec):
validate = {
"item_scraped_count": MoreThan(0),
}
Afterwards it can be run against live or cached results:
$ scrapy-test
# or
$ scrapy-test --cache
I've been running cached runs for development changes and daily cronjobs for detecting website changes.
I'm using Twisted's trial
to run tests, similar to Scrapy's own tests. It already starts a reactor, so I make use of the CrawlerRunner
without worrying about starting and stopping one in the tests.
Stealing some ideas from the check
and parse
Scrapy commands I ended up with the following base TestCase
class to run assertions against live sites:
from twisted.trial import unittest
from scrapy.crawler import CrawlerRunner
from scrapy.http import Request
from scrapy.item import BaseItem
from scrapy.utils.spider import iterate_spider_output
class SpiderTestCase(unittest.TestCase):
def setUp(self):
self.runner = CrawlerRunner()
def make_test_class(self, cls, url):
"""
Make a class that proxies to the original class,
sets up a URL to be called, and gathers the items
and requests returned by the parse function.
"""
class TestSpider(cls):
# This is a once used class, so writing into
# the class variables is fine. The framework
# will instantiate it, not us.
items = []
requests = []
def start_requests(self):
req = super(TestSpider, self).make_requests_from_url(url)
req.meta["_callback"] = req.callback or self.parse
req.callback = self.collect_output
yield req
def collect_output(self, response):
try:
cb = response.request.meta["_callback"]
for x in iterate_spider_output(cb(response)):
if isinstance(x, (BaseItem, dict)):
self.items.append(x)
elif isinstance(x, Request):
self.requests.append(x)
except Exception as ex:
print("ERROR", "Could not execute callback: ", ex)
raise ex
# Returning any requests here would make the crawler follow them.
return None
return TestSpider
Example:
@defer.inlineCallbacks
def test_foo(self):
tester = self.make_test_class(FooSpider, 'https://foo.com')
yield self.runner.crawl(tester)
self.assertEqual(len(tester.items), 1)
self.assertEqual(len(tester.requests), 2)
or perform one request in the setup and run multiple tests against the results:
@defer.inlineCallbacks
def setUp(self):
super(FooTestCase, self).setUp()
if FooTestCase.tester is None:
FooTestCase.tester = self.make_test_class(FooSpider, 'https://foo.com')
yield self.runner.crawl(self.tester)
def test_foo(self):
self.assertEqual(len(self.tester.items), 1)
I'm using scrapy 1.3.0 and the function: fake_response_from_file, raise an error:
response = Response(url=url, request=request, body=file_content)
I get:
raise AttributeError("Response content isn't text")
The solution is to use TextResponse instead, and it works ok, as example:
response = TextResponse(url=url, request=request, body=file_content)
Thanks a lot.
Slightly simpler, by removing the def fake_response_from_file
from the chosen answer:
import unittest
from spiders.my_spider import MySpider
from scrapy.selector import Selector
class TestParsers(unittest.TestCase):
def setUp(self):
self.spider = MySpider(limit=1)
self.html = Selector(text=open("some.htm", 'r').read())
def test_some_parse(self):
expected = "some-text"
result = self.spider.some_parse(self.html)
self.assertEqual(result, expected)
if __name__ == '__main__':
unittest.main()