Crawl and Concatenate in Scrapy

问题

I'm trying to crawl movie list with Scrapy (I take only the Director & Movie title fields). Sometimes, there are two directors and Scrapy scape them as different. So the first director will be alon the movie title but for the second there will be no movie title.

So I created a condition like this :

if director2:
            item['director'] = map(unicode.strip,titres.xpath("tbody/tr/td/div/div[2]/div[3]/div[2]/div/h2/div/a/text()").extract())

The last div[2] exists only if there are two directors.

And I wanted to concatenate like this : director1, director2

Here is my full code :

class movies(scrapy.Spider):
name ="movielist"
allowed_domains = ["domain.com"]
start_urls = ["http://www.domain.com/list"]

def parse(self, response):
    for titles in response.xpath('//*[contains(concat(" ", normalize-space(@class), " "), " grid")]'):
        item = MovieItem()
        director2 = Selector(text=html_content).xpath("h2/div[2]/a/text()")
        if director2:
            item['director'] = map(unicode.strip,titres.xpath,string-join("h2//concat(div[1]/a/text(), ".", div[2]/a/text())").extract())
        else:
            item['director'] = map(unicode.strip,titres.xpath("h2/div/a/text()").extract())
            item['director'] = map(unicode.strip,titres.xpath,string-join("h2//concat(div[1]/a/text(), ".", div[2]/a/text())").extract())
            item['title'] = map(unicode.strip,titres.xpath("h2/a/text()").extract())
        yield item

Sample HTML with one director:

<h2>
    <a href="#">Movie's title</a>
    <div>Info</div>
    <div><a href="#">Director's name</a></div>
</h2>

Sometime, when there are two directors :

<h2>
    <a href="#">Movie's title</a>
    <div>Info</div>
    <div><a href="#">Director's name</a></div>
    <div><a href="#">Second director's name</a></div>
</h2>

Can you tell me what's wrong with my syntax ?

I tested without the condition and withtout the concatenation and it works very well.

This is my first time with Python so please be indulgent.

Thank you very much.

回答1:

Get all the directors (1, 2 or more) and join them with join():

item['director'] = ", ".join(titles.xpath("h2/div/a/text()").extract())

A better Scrapy-specific approach though would be to use an ItemLoader and Join() processor. Define an ItemLoader:

from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import TakeFirst, MapCompose, Join

class MovieLoader(ItemLoader):

    default_output_processor = TakeFirst()

    director_in = MapCompose(unicode.strip)
    director_out = Join()

And let it worry about stripping and joining:

loader = MovieLoader(MovieItem(), titles)
...
loader.add_xpath("director", "h2/div/a/text()")

来源：https://stackoverflow.com/questions/29434591/crawl-and-concatenate-in-scrapy

标签

python

xpath

web-crawler

scrapy