How to extract raw html from a Scrapy selector?

后端 未结 3 1428
孤城傲影
孤城傲影 2021-01-12 15:35

I\'m extracting js data using response.xpath(\'//*\')re_first() and later converting it to python native data. The problem is extract/re methods don\'t seem to provide a way

相关标签:
3条回答
  • 2021-01-12 15:46

    Since parsel 1.2.0 (2017-05-17) you can pass replace_entities=False to both re and re_first to avoid the default behavior.

    0 讨论(0)
  • 2021-01-12 15:53

    Short answer:

    • Scrapy/Parsel selectors' .re() and .re_first() methods replace HTML entities (except <, &)
    • instead, use .extract() or .extract_first() to get raw HTML (or raw JavaScript instructions) and use Python's re module on extracted string

    Long answer:

    Let's look at an example input and various ways of extracting Javascript data from HTML.

    Sample HTML:

    <html lang="en">
    <body>
    <div>
        <script type="text/javascript">
            var i = {a:['O&#39;Connor Park']}
        </script>
    </div>
    </body>
    </html>
    

    Using scrapy Selector, which is using the parsel library underneath, you have several ways of extracting the Javascript snippet:

    >>> import scrapy
    >>> t = """<html lang="en">
    ... <body>
    ... <div>
    ...     <script type="text/javascript">
    ...         var i = {a:['O&#39;Connor Park']}
    ...     </script>
    ...     
    ... </div>
    ... </body>
    ... </html>
    ... """
    >>> selector = scrapy.Selector(text=t, type="html")
    >>> 
    >>> # extracting the <script> element as raw HTML
    >>> selector.xpath('//div/script').extract_first()
    u'<script type="text/javascript">\n        var i = {a:[\'O&#39;Connor Park\']}\n    </script>'
    >>> 
    >>> # only getting the text node inside the <script> element
    >>> selector.xpath('//div/script/text()').extract_first()
    u"\n        var i = {a:['O&#39;Connor Park']}\n    "
    >>> 
    

    Now, Using .re (or .re_first) you get different result:

    >>> # I'm using a very simple "catch-all" regex
    >>> # you are probably using a regex to extract
    >>> # that specific "O'Connor Park" string
    >>> selector.xpath('//div/script/text()').re_first('.+')
    u"        var i = {a:['O'Connor Park']}"
    >>> 
    >>> # .re() on the element itself, one needs to handle newlines
    >>> selector.xpath('//div/script').re_first('.+')
    u'<script type="text/javascript">'    # only first line extracted
    >>> import re
    >>> selector.xpath('//div/script').re_first(re.compile('.+', re.DOTALL))
    u'<script type="text/javascript">\n        var i = {a:[\'O\'Connor Park\']}\n    </script>'
    >>> 
    

    The HTML entity &#39; has been replaced by an apostrophe. This is due to a w3lib.html.replace_entities() call in .re/re_first implementation (see parsel source code, in extract_regex function), which is not used when simply calling extract() or extract_first()

    0 讨论(0)
  • 2021-01-12 15:53

    You can also utilise the same function that is used by the Selector class' extract method, but with different arguments:

    from lxml import etree
    etree.tostring(selector._root)
    
    0 讨论(0)
提交回复
热议问题