suppress Scrapy Item printed in logs after pipeline

后端 未结 8 568
无人及你
无人及你 2020-12-25 12:43

I have a scrapy project where the item that ultimately enters my pipeline is relatively large and stores lots of metadata and content. Everything is working properly in my s

相关标签:
8条回答
  • 2020-12-25 12:50

    Having read through the documentation and conducted a (brief) search through the source code, I can't see a straightforward way of achieving this aim.

    The hammer approach is to set the logging level in the settings to INFO (ie add the following line to settings.py):

    LOG_LEVEL='INFO'

    This will strip out a lot of other information about the URLs/page that are being crawled, but it will definitely suppress data about processed items.

    0 讨论(0)
  • 2020-12-25 12:54

    or If you know that spider is working correctly then you can disable the entire logging

    LOG_ENABLED = False

    I disable that when my crawler runs fine

    0 讨论(0)
  • 2020-12-25 12:55

    Another approach is to override the __repr__ method of the Item subclasses to selectively choose which attributes (if any) to print at the end of the pipeline:

    from scrapy.item import Item, Field
    class MyItem(Item):
        attr1 = Field()
        attr2 = Field()
        # ...
        attrN = Field()
    
        def __repr__(self):
            """only print out attr1 after exiting the Pipeline"""
            return repr({"attr1": self.attr1})
    

    This way, you can keep the log level at DEBUG and show only the attributes that you want to see coming out of the pipeline (to check attr1, for example).

    0 讨论(0)
  • 2020-12-25 13:07

    I think the cleanest way to do this is to add a filter to the scrapy.core.scraper logger that changes the message in question. This allows you to keep your Item's __repr__ intact and to not have to change scrapy's logging level:

    import re
    
    class ItemMessageFilter(logging.Filter):
        def filter(self, record):
            # The message that logs the item actually has raw % operators in it,
            # which Scrapy presumably formats later on
            match = re.search(r'(Scraped from %\(src\)s)\n%\(item\)s', record.msg)
            if match:
                # Make the message everything but the item itself
                record.msg = match.group(1)
            # Don't actually want to filter out this record, so always return 1
            return 1
    
    logging.getLogger('scrapy.core.scraper').addFilter(ItemMessageFilter())
    
    0 讨论(0)
  • 2020-12-25 13:08

    I tried the repre way mentioned by @dino, it doesn't work well. But evolved from his idea, I tried the str method, and it works.

    Here's how I do it, very simple:

        def __str__(self):
            return ""
    
    0 讨论(0)
  • 2020-12-25 13:08

    If you found your way here because you had the same question years later, the easiest way to do this is with a LogFormatter:

    class QuietLogFormatter(scrapy.logformatter.LogFormatter):
        def scraped(self, item, response, spider):
            return (
                super().scraped(item, response, spider)
                if spider.settings.getbool("LOG_SCRAPED_ITEMS")
                else None
            )
    

    Just add LOG_FORMATTER = "path.to.QuietLogFormatter" to your settings.py and you will see all your DEBUG messages except for the scraped items. With LOG_SCRAPED_ITEMS = True you can restore the previous behaviour without having to change your LOG_FORMATTER.

    Similarly you can customise the logging behaviour for crawled pages and dropped items.

    Edit: I wrapped up this formatter and some other Scrapy stuff in this library.

    0 讨论(0)
提交回复
热议问题