I have a scrapy project where the item that ultimately enters my pipeline is relatively large and stores lots of metadata and content. Everything is working properly in my s
If you want to exclude only some attributes of the output, you can extend the answer given by @dino
from scrapy.item import Item, Field
import json
class MyItem(Item):
attr1 = Field()
attr2 = Field()
attr1ToExclude = Field()
attr2ToExclude = Field()
# ...
attrN = Field()
def __repr__(self):
r = {}
for attr, value in self.__dict__['_values'].iteritems():
if attr not in ['attr1ToExclude', 'attr2ToExclude']:
r[attr] = value
return json.dumps(r, sort_keys=True, indent=4, separators=(',', ': '))
We use the following sample in production:
import logging
logging.getLogger('scrapy.core.scraper').addFilter(
lambda x: not x.getMessage().startswith('Scraped from'))
This is a very simple and working code. We add this code in __init__.py
in module with spiders. In this case this code automatically run with command like scrapy crawl <spider_name>
for all spiders.