How can I get an output in UTF-8 encoded unicode from Scrapy?

后端 未结 2 836
南旧
南旧 2021-01-19 10:22

Bear with me. I\'m writing every detail because so many parts of the toolchain do not handle Unicode gracefully and it\'s not clear what is failing.

PRELUDE<

2条回答
  •  北海茫月
    2021-01-19 11:17

    please try this on your Attempt 1 and let me know if it works (I've test it without setting all those env. variables)

    def to_write(uni_str):
        return urllib.unquote(uni_str.encode('utf8')).decode('utf8')
    
    
    class CitiesSpider(scrapy.Spider):
        name = "cities"
        allowed_domains = ["sitercity.info"]
        start_urls = (
            'http://en.sistercity.info/sister-cities/Düsseldorf.html',
        )
    
        def parse(self, response):
            for i in range(2):
                item = SimpleItem()
                item['title'] = to_write(response.xpath('//title').extract_first())
                item['url'] = to_write(response.url)
                yield item
    

    the range(2) is for testing the json exporter, to get a list of dicts you can do this instead:

    # -*- coding: utf-8 -*-
    from scrapy.contrib.exporter import JsonItemExporter
    from scrapy.utils.serialize import ScrapyJSONEncoder
    
    class UnicodeJsonLinesItemExporter(JsonItemExporter):
        def __init__(self, file, **kwargs):
            self._configure(kwargs, dont_fail=True)
            self.file = file
            self.encoder = ScrapyJSONEncoder(ensure_ascii=False, **kwargs)
            self.first_item = True
    

提交回复
热议问题