Cleaning data scraped using Scrapy

问题

I have recently started using Scrapy and am trying to clean some data I have scraped and want to export to CSV, namely the following three examples:

Example 1 – removing certain text
Example 2 – removing/replacing unwanted characters
Example 3 –splitting comma separated text

Example 1 data looks like:

Text I want,Text I don’t want

Using the following code:

'Scraped 1': response.xpath('//div/div/div/h1/span/text()').extract()

Example 2 data looks like:

Â - but I want to change this to £

Using the following code:

' Scraped 2': response.xpath('//html/body/div/div/section/div/form/div/div/em/text()').extract()

Example 3 data looks like:

Item 1,Item 2,Item 3,Item 4,Item 4,Item5 – ultimately I want to split this into separate columns in a CSV file

Using the following code:

' Scraped 3': response.xpath('//div/div/div/ul/li/p/text()').extract()

I have tried using str.replace(), but can’t seem to get that to work, e.g: 'Scraped 1': response.xpath('//div/div/div/h1/span/text()').extract((str.replace(",Text I don't want",""))

I am looking into this but what appreciate if anyone could point me in the right direction!

Code below:

import scrapy
from scrapy.loader import ItemLoader
from tutorial.items import Product


class QuotesSpider(scrapy.Spider):
    name = "quotes_product"
    start_urls = [
        'http://www.unitestudents.com/',
            ]

    # Step 1
    def parse(self, response):
        for city in response.xpath('//select[@id="frm_homeSelect_city"]/option[not(contains(text(),"Select your city"))]/text()').extract(): # Select all cities listed in the select (exclude the "Select your city" option)
            yield scrapy.Request(response.urljoin("/"+city), callback=self.parse_citypage)

    # Step 2
    def parse_citypage(self, response):
        for url in response.xpath('//div[@class="property-header"]/h3/span/a/@href').extract(): #Select for each property the url
            yield scrapy.Request(response.urljoin(url), callback=self.parse_unitpage)


    # Step 3
    def parse_unitpage(self, response):
        for final in response.xpath('//div/div/div[@class="content__btn"]/a/@href').extract(): #Select final page for data scrape
            yield scrapy.Request(response.urljoin(final), callback=self.parse_final)

    #Step 4 
    def parse_final(self, response):
        unitTypes = response.xpath('//html/body/div').extract()
        for unitType in unitTypes: # There can be multiple unit types so we yield an item for each unit type we can find.
            l = ItemLoader(item=Product(), response=response)
            l.add_xpath('area_name', '//div/ul/li/a/span/text()')
            l.add_xpath('type', '//div/div/div/h1/span/text()')
            l.add_xpath('period', '/html/body/div/div/section/div/form/h4/span/text()')
            l.add_xpath('duration_weekly', '//html/body/div/div/section/div/form/div/div/em/text()')
            l.add_xpath('guide_total', '//html/body/div/div/section/div/form/div/div/p/text()')
            l.add_xpath('amenities','//div/div/div/ul/li/p/text()')
            return l.load_item()

However, I'm getting the following?

value = self.item.fields[field_name].get(key, default)
KeyError: 'type'

回答1:

You have the right idea with str.replace, although I would suggest the Python 're' regular expressions library as it is more powerful. The documentation is top notch and you can find some useful code samples there.

I am not familiar with the scrapy library, but it looks like .extract() returns a list of strings. If you want to transform these using str.replace or one of the regex functions, you will need to use a list comprehension:

'Selector 1': [ x.replace('A', 'B') for x in response.xpath('...').extract() ]

Edit: Regarding the separate columns-- if the data is already comma-separated just write it directly to a file! If you want to split the comma-separated data to do some transformations, you can use str.split like this:

"A,B,C".split(",") # returns [ "A", "B", "C" ]

In this case, the data returned from .extract() will be a list of comma-separated strings. If you use a list comprehension as above, you will end up with a list-of-lists.

If you want something more sophisticated than splitting on each comma, you can use python's csv library.

回答2:

It would be much easier to provide a more specific answer if you would have provided your spider and item definitions. Here are some generic guidelines.

If you want to keep things modular and follow the Scrapy's suggest project architecture and separation of concerns, you should be cleaning and preparing your data for further export via Item Loaders with input and output processors.

For the first two examples, MapCompose looks like a good fit.

来源：https://stackoverflow.com/questions/43539407/cleaning-data-scraped-using-scrapy

标签

python

web-scraping

scrapy

data-cleaning