Cleaning data scraped using Scrapy

跟風遠走 提交于 2019-12-04 20:23:12

You have the right idea with str.replace, although I would suggest the Python 're' regular expressions library as it is more powerful. The documentation is top notch and you can find some useful code samples there.

I am not familiar with the scrapy library, but it looks like .extract() returns a list of strings. If you want to transform these using str.replace or one of the regex functions, you will need to use a list comprehension:

'Selector 1': [ x.replace('A', 'B') for x in response.xpath('...').extract() ]

Edit: Regarding the separate columns-- if the data is already comma-separated just write it directly to a file! If you want to split the comma-separated data to do some transformations, you can use str.split like this:

"A,B,C".split(",") # returns [ "A", "B", "C" ]

In this case, the data returned from .extract() will be a list of comma-separated strings. If you use a list comprehension as above, you will end up with a list-of-lists.

If you want something more sophisticated than splitting on each comma, you can use python's csv library.

It would be much easier to provide a more specific answer if you would have provided your spider and item definitions. Here are some generic guidelines.

If you want to keep things modular and follow the Scrapy's suggest project architecture and separation of concerns, you should be cleaning and preparing your data for further export via Item Loaders with input and output processors.

For the first two examples, MapCompose looks like a good fit.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!