I have a scrapy project where the item that ultimately enters my pipeline is relatively large and stores lots of metadata and content. Everything is working properly in my s
Having read through the documentation and conducted a (brief) search through the source code, I can't see a straightforward way of achieving this aim.
The hammer approach is to set the logging level in the settings to INFO (ie add the following line to settings.py):
LOG_LEVEL='INFO'
This will strip out a lot of other information about the URLs/page that are being crawled, but it will definitely suppress data about processed items.
or If you know that spider is working correctly then you can disable the entire logging
LOG_ENABLED = False
I disable that when my crawler runs fine
Another approach is to override the __repr__
method of the Item
subclasses to selectively choose which attributes (if any) to print at the end of the pipeline:
from scrapy.item import Item, Field
class MyItem(Item):
attr1 = Field()
attr2 = Field()
# ...
attrN = Field()
def __repr__(self):
"""only print out attr1 after exiting the Pipeline"""
return repr({"attr1": self.attr1})
This way, you can keep the log level at DEBUG
and show only the attributes that you want to see coming out of the pipeline (to check attr1
, for example).
I think the cleanest way to do this is to add a filter to the scrapy.core.scraper
logger that changes the message in question. This allows you to keep your Item's __repr__
intact and to not have to change scrapy's logging level:
import re
class ItemMessageFilter(logging.Filter):
def filter(self, record):
# The message that logs the item actually has raw % operators in it,
# which Scrapy presumably formats later on
match = re.search(r'(Scraped from %\(src\)s)\n%\(item\)s', record.msg)
if match:
# Make the message everything but the item itself
record.msg = match.group(1)
# Don't actually want to filter out this record, so always return 1
return 1
logging.getLogger('scrapy.core.scraper').addFilter(ItemMessageFilter())
I tried the repre way mentioned by @dino, it doesn't work well. But evolved from his idea, I tried the str method, and it works.
Here's how I do it, very simple:
def __str__(self):
return ""
If you found your way here because you had the same question years later, the easiest way to do this is with a LogFormatter:
class QuietLogFormatter(scrapy.logformatter.LogFormatter):
def scraped(self, item, response, spider):
return (
super().scraped(item, response, spider)
if spider.settings.getbool("LOG_SCRAPED_ITEMS")
else None
)
Just add LOG_FORMATTER = "path.to.QuietLogFormatter"
to your settings.py
and you will see all your DEBUG
messages except for the scraped items. With LOG_SCRAPED_ITEMS = True
you can restore the previous behaviour without having to change your LOG_FORMATTER
.
Similarly you can customise the logging behaviour for crawled pages and dropped items.
Edit: I wrapped up this formatter and some other Scrapy stuff in this library.