Parsing the urls in sitemap with different url format using sitemap spider in scrapy, python

三世轮回 提交于 2019-12-07 20:35:16

问题


I am using sitemap spider in scrapy, python. The sitemap seems to have unusual format with '//' in front of urls:

<url>
    <loc>//www.example.com/10/20-baby-names</loc>
</url>
<url>
    <loc>//www.example.com/elizabeth/christmas</loc>
 </url>

myspider.py

from scrapy.contrib.spiders import SitemapSpider
from myspider.items import *

class MySpider(SitemapSpider):
    name = "myspider"
    sitemap_urls = ["http://www.example.com/robots.txt"]

    def parse(self, response):
        item = PostItem()           
        item['url'] = response.url
        item['title'] = response.xpath('//title/text()').extract()

        return item

I am getting this error:

raise ValueError('Missing scheme in request url: %s' % self._url)
    exceptions.ValueError: Missing scheme in request url: //www.example.com/10/20-baby-names

How can I manually parse the url using sitemap spider?


回答1:


I think the nicest and cleanest solution would be to add a downloader middleware which changes the malicious URLs without the spider noticing.

import re
import urlparse
from scrapy.http import XmlResponse
from scrapy.utils.gz import gunzip, is_gzipped
from scrapy.contrib.spiders import SitemapSpider

# downloader middleware
class SitemapWithoutSchemeMiddleware(object):
    def process_response(self, request, response, spider):
        if isinstance(spider, SitemapSpider):
            body = self._get_sitemap_body(response)

            if body:
                scheme = urlparse.urlsplit(response.url).scheme
                body = re.sub(r'<loc>\/\/(.+)<\/loc>', r'<loc>%s://\1</loc>' % scheme, body)    
                return response.replace(body=body)

        return response

    # this is from scrapy's Sitemap class, but sitemap is
    # only for internal use and it's api can change without
    # notice
    def _get_sitemap_body(self, response):
        """Return the sitemap body contained in the given response, or None if the
        response is not a sitemap.
        """
        if isinstance(response, XmlResponse):
            return response.body
        elif is_gzipped(response):
            return gunzip(response.body)
        elif response.url.endswith('.xml'):
            return response.body
        elif response.url.endswith('.xml.gz'):
            return gunzip(response.body)



回答2:


If I see it correctly, you could (for a quick solution) override the default implementation of _parse_sitemap in SitemapSpider. It's not nice, because you will have to copy a lot of code, but should work. You'll have to add a method to generate a URL with scheme.

"""if the URL starts with // take the current website scheme and make an absolute
URL with the same scheme"""
def _fix_url_bug(url, current_url):
    if url.startswith('//'):
           ':'.join((urlparse.urlsplit(current_url).scheme, url))
       else:
           yield url

def _parse_sitemap(self, response):
    if response.url.endswith('/robots.txt'):
        for url in sitemap_urls_from_robots(response.body)
            yield Request(url, callback=self._parse_sitemap)
    else:
        body = self._get_sitemap_body(response)
        if body is None:
            log.msg(format="Ignoring invalid sitemap: %(response)s",
                    level=log.WARNING, spider=self, response=response)
            return

        s = Sitemap(body)
        if s.type == 'sitemapindex':
            for loc in iterloc(s):
                # added it before follow-test, to allow test to return true
                # if it includes the scheme (yet do not know if this is the better solution)
                loc = _fix_url_bug(loc, response.url)
                if any(x.search(loc) for x in self._follow):
                    yield Request(loc, callback=self._parse_sitemap)
        elif s.type == 'urlset':
            for loc in iterloc(s):
                loc = _fix_url_bug(loc, response.url) # same here
                for r, c in self._cbs:
                    if r.search(loc):
                        yield Request(loc, callback=c)
                        break

This is just a general idea and untested. So it could both either totally not work or there could be syntax errors. Please respond via comments, so I can improve my answer.

The sitemap you are trying to parse, seems to be wrong. From RFC a missing scheme is perfectly fine, but sitemaps require URLs to begin with a scheme.




回答3:


I used the trick by @alecxe to parse the urls within the spider. I made it work but not sure if it is the best way to do it.

from urlparse import urlparse
import re 
from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.utils.response import body_or_str
from example.items import *

class ExampleSpider(BaseSpider):
    name = "example"
    start_urls = ["http://www.example.com/sitemap.xml"]

    def parse(self,response):
        nodename = 'loc'
        text = body_or_str(response)
        r = re.compile(r"(<%s[\s>])(.*?)(</%s>)" % (nodename, nodename), re.DOTALL)
        for match in r.finditer(text):
            url = match.group(2)
            if url.startswith('//'):
                url = 'http:'+url
                yield Request(url, callback=self.parse_page)

    def parse_page(self, response):
        # print response.url
        item = PostItem()   

        item['url'] = response.url
        item['title'] = response.xpath('//title/text()').extract()
        return item


来源:https://stackoverflow.com/questions/27286927/parsing-the-urls-in-sitemap-with-different-url-format-using-sitemap-spider-in-sc

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!