Scrapy Tutorial Example

前端未结

关注

 2  1505

深忆病人

Looking to see if someone can point me in the right direction in regards to using Scrapy in python.

I\'ve been trying to follow the example for several days and still ca

相关标签:

2条回答

不思量自难忘°

2021-01-24 07:48

Seems like this spider is outdated in the tutorial. The website has changed a bit so all of the xpaths now capture nothing. This is easily fixable:

def parse(self, response):
    sites = response.xpath('//div[@class="title-and-desc"]/a')
    for site in sites:
        item = dict()
        item['name'] = site.xpath("text()").extract_first() 
        item['url'] = site.xpath("@href").extract_first() 
        item['description'] = site.xpath("following-sibling::div/text()").extract_first('').strip()
        yield item

For future reference you can always test whether a specific xpath works with scrapy shell command.
e.g. what I did to test this:

$ scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
# test sites xpath
response.xpath('//ul[@class="directory-url"]/li') 
[]
# ok it doesn't work, check out page in web browser
view(response)
# find correct xpath and test that:
response.xpath('//div[@class="title-and-desc"]/a')
# 21 result nodes printed
# it works!

0 讨论(0)

误落风尘

2021-01-24 08:00

Here is the correction of the Scrapy code to extract details from DMOZ:

import scrapy

class MozSpider(scrapy.Spider):
name = "moz"
allowed_domains = ["www.dmoz.org"]
start_urls = ['http://www.dmoz.org/Computers/Programming/Languages/Python/Books/',
'http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/']

    def parse(self, response):
        sites = response.xpath('//div[@class="title-and-desc"]')
        for site in sites:
            name = site.xpath('a/div[@class="site-title"]/text()').extract_first()
            url = site.xpath('a/@href').extract_first()
            description = site.xpath('div[@class="site-descr "]/text()').extract_first().strip()

            yield{'Name':name, 'URL':url, 'Description':description}

To export it into CSV, open the spider folder in your Terminal/CMD and type:

scrapy crawl moz -o result.csv

Here is another basic Scrapy tutorial: to extract company details from YellowPages:

import scrapy

class YlpSpider(scrapy.Spider):
name = "ylp"
allowed_domains = ["www.yellowpages.com"]
start_urls = ['http://www.yellowpages.com/search?search_terms=Translation&geo_location_terms=Virginia+Beach%2C+VA']


    def parse(self, response):
        companies = response.xpath('//*[@class="info"]')

        for company in companies:
            name = company.xpath('h3/a/span[@itemprop="name"]/text()').extract_first()
            phone = company.xpath('div/div[@class="phones phone primary"]/text()').extract_first()
            website = company.xpath('div/div[@class="links"]/a/@href').extract_first()

            yield{'Name':name,'Phone':phone, 'Website':website}

To export it into CSV, open the spider folder in your Terminal/CMD and type:

scrapy crawl ylp -o result.csv

This Scrapy code is to extract company details from Yelp:

import scrapy

class YlpSpider(scrapy.Spider):
    name = "yelp"
    allowed_domains = ["www.yelp.com"]
    start_urls = ['https://www.yelp.com/search?find_desc=Java+Developer&find_loc=Denver,+CO']


    def parse(self, response):
        companies = response.xpath('//*[@class="biz-listing-large"]')

        for company in companies:
            name = company.xpath('.//span[@class="indexed-biz-name"]/a/span/text()').extract_first()
            address1 = company.xpath('.//address/text()').extract_first('').strip()
            address2 = company.xpath('.//address/text()[2]').extract_first('').strip()  # '' means the default attribute if not found to avoid adding None.
            address = address1 + " - " + address2
            phone = company.xpath('.//*[@class="biz-phone"]/text()').extract_first().strip()
            website = "https://www.yelp.com" + company.xpath('.//@href').extract_first()

            yield{'Name':name, 'Address':address, 'Phone':phone, 'Website':website}

To export it into CSV, open the spider folder in your Terminal/CMD and type:

scrapy crawl yelp -o result.csv

This is a comprehensive online course on Scrapy:

https://www.udemy.com/scrapy-tutorial-web-scraping-with-python/?couponCode=STACK39243009-SCRAPY

All the best!

0 讨论(0)