How to get contents of HTML Script tag

问题

I'm trying to scrape the geo data from a URL for my scraping practice. But I'm having trouble while handling contents of script tag.

Following is the contents of script tag :

<script type="application/ld+json">
    {
     "address": {
            "@type": "PostalAddress",
            "streetAddress": "5080 Riverside Drive",
            "addressLocality": "Macon",
            "addressRegion": "GA",
            "postalCode": "31210-1100",
            "addressCountry": "US"
        },
        "telephone": "478-471-0171",
        "geo": {
            "@type": "GeoCoordinates",
            "latitude": "32.9252435",
            "longitude": "-83.7145993"
        }
    }
    </script>

I want to add contents of script tag (city, state, lat, long and phone no.) to my result.

Following is my code(incomplete) :

def parse(self,response)
    items = MyItem()
    tree = Selector(response)

    items['city'] = tree.xpath('//script/text()').extract()[0]
    items['state'] = tree.xpath('//script/text()').extract()[0]
    items['latitude'] = tree.xpath('//script/text()').extract()[0]
    items['longitude'] = tree.xpath('//script/text()').extract()[0]
    items['telephone'] = tree.xpath('//script/text()').extract()[0]
    print(items)
    yield items

Can I get any suggestions on how to achieve this?

回答1:

I don't understand what you're trying to do with the repeated xpath queries //item/title/text(). Note that xpath is useful for extracting HTML content. The content of the <script> tag in your question is not HTML, so it's not possible to query that with xpath.

In a first step you can get the content of the <script> tag:

content = tree.xpath('//script/text()').extract()[0]

And then you can use the json package to load the json content into a Python dictionary:

d = json.loads(content)

Also note that the JSON in the <script> in your example is not valid, it's missing a closing brace. The above method only works with valid content.

来源：https://stackoverflow.com/questions/49327937/how-to-get-contents-of-html-script-tag

标签

python

pandas

scrapy

scrapy-spider