Extract data from a gsmarena page using scrapy

拥有回忆 提交于 2019-12-11 03:18:38

问题


I'm trying to download data from a gsmarena page: "http://www.gsmarena.com/htc_one_me-7275.php".

However the data is classified in form of tables and table rows. The data is of the format:

table header > td[@class='ttl'] > td[@class='nfo']

Edited code: Thanks to the help of community members at stackexchange, I've reformatted the code as: Items.py file:

import scrapy

class gsmArenaDataItem(scrapy.Item):
    phoneName = scrapy.Field()
    phoneDetails = scrapy.Field()
    pass

Spider file:

from scrapy.selector import Selector
from scrapy import Spider
from gsmarena_data.items import gsmArenaDataItem

class testSpider(Spider):
    name = "mobile_test"
    allowed_domains = ["gsmarena.com"]
    start_urls = ('http://www.gsmarena.com/htc_one_me-7275.php',)

    def parse(self, response):
        # extract whatever stuffs you want and yield items here
        hxs = Selector(response)
        phone = gsmArenaDataItem()
        tableRows = hxs.css("div#specs-list table")
        for tableRows in tableRows:
            phone['phoneName'] = tableRows.xpath(".//th/text()").extract()[0]
            for ttl in tableRows.xpath(".//td[@class='ttl']"):
                ttl_value = " ".join(ttl.xpath(".//text()").extract())
                nfo_value = " ".join(ttl.xpath("following-sibling::td[@class='nfo']//text()").extract())
                colonSign = ": "
                commaSign = ", "
                seq = [ttl_value, colonSign, nfo_value, commaSign]
                phone['phoneDetails'] = "".join(seq)
        yield phone

However, I'm getting banned as soon as I try to even load the page in scrapy shell using:

"http://www.gsmarena.com/htc_one_me-7275.php"

I've even tried using DOWNLOAD_DELAY = 3 in settings.py.

Kindly suggest how should I go about it.


回答1:


The idea would be to iterate over all table elements inside the "spec-list", get the th element for the block name, get all the td elements with class="ttl" and corresponding following td siblings with class="nfo".

Demo from the shell:

In [1]: for scope in response.css("div#specs-list table"):
            scope_name = scope.xpath(".//th/text()").extract()[0]

            for ttl in scope.xpath(".//td[@class='ttl']"):
                ttl_value = " ".join(ttl.xpath(".//text()").extract())
                nfo_value = " ".join(ttl.xpath("following-sibling::td[@class='nfo']//text()").extract())

                print scope_name, ttl_value, nfo_value
   ....:     
Network Technology GSM / HSPA / LTE
Network 2G bands GSM 850 / 900 / 1800 / 1900 - SIM 1 & SIM 2
...
Battery Stand-by Up to 598 h (2G) / Up to 626 h (3G)
Battery Talk time Up to 23 h (2G) / Up to 13 h (3G)
Misc Colors Meteor Grey, Rose Gold, Gold Sepia



回答2:


I also faced the same problem of getting banned within few requests, changing proxies using scrapy-proxies and using autothrottling helped significantly, but did not solve the problem completely.

You can find my code at gsmarenacrawler



来源:https://stackoverflow.com/questions/30673602/extract-data-from-a-gsmarena-page-using-scrapy

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!