Crawl and Concatenate in Scrapy

和自甴很熟 提交于 2021-02-07 20:24:06

问题


I'm trying to crawl movie list with Scrapy (I take only the Director & Movie title fields). Sometimes, there are two directors and Scrapy scape them as different. So the first director will be alon the movie title but for the second there will be no movie title.

So I created a condition like this :

if director2:
            item['director'] = map(unicode.strip,titres.xpath("tbody/tr/td/div/div[2]/div[3]/div[2]/div/h2/div/a/text()").extract())

The last div[2] exists only if there are two directors.

And I wanted to concatenate like this : director1, director2

Here is my full code :

class movies(scrapy.Spider):
name ="movielist"
allowed_domains = ["domain.com"]
start_urls = ["http://www.domain.com/list"]

def parse(self, response):
    for titles in response.xpath('//*[contains(concat(" ", normalize-space(@class), " "), " grid")]'):
        item = MovieItem()
        director2 = Selector(text=html_content).xpath("h2/div[2]/a/text()")
        if director2:
            item['director'] = map(unicode.strip,titres.xpath,string-join("h2//concat(div[1]/a/text(), ".", div[2]/a/text())").extract())
        else:
            item['director'] = map(unicode.strip,titres.xpath("h2/div/a/text()").extract())
            item['director'] = map(unicode.strip,titres.xpath,string-join("h2//concat(div[1]/a/text(), ".", div[2]/a/text())").extract())
            item['title'] = map(unicode.strip,titres.xpath("h2/a/text()").extract())
        yield item

Sample HTML with one director:

<h2>
    <a href="#">Movie's title</a>
    <div>Info</div>
    <div><a href="#">Director's name</a></div>
</h2>

Sometime, when there are two directors :

<h2>
    <a href="#">Movie's title</a>
    <div>Info</div>
    <div><a href="#">Director's name</a></div>
    <div><a href="#">Second director's name</a></div>
</h2>

Can you tell me what's wrong with my syntax ?

I tested without the condition and withtout the concatenation and it works very well.

This is my first time with Python so please be indulgent.

Thank you very much.


回答1:


Get all the directors (1, 2 or more) and join them with join():

item['director'] = ", ".join(titles.xpath("h2/div/a/text()").extract())

A better Scrapy-specific approach though would be to use an ItemLoader and Join() processor. Define an ItemLoader:

from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import TakeFirst, MapCompose, Join

class MovieLoader(ItemLoader):

    default_output_processor = TakeFirst()

    director_in = MapCompose(unicode.strip)
    director_out = Join()

And let it worry about stripping and joining:

loader = MovieLoader(MovieItem(), titles)
...
loader.add_xpath("director", "h2/div/a/text()")


来源:https://stackoverflow.com/questions/29434591/crawl-and-concatenate-in-scrapy

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!