问题
I'm trying to scrape the geo data from a URL for my scraping practice. But I'm having trouble while handling contents of script tag.
Following is the contents of script tag :
<script type="application/ld+json">
{
"address": {
"@type": "PostalAddress",
"streetAddress": "5080 Riverside Drive",
"addressLocality": "Macon",
"addressRegion": "GA",
"postalCode": "31210-1100",
"addressCountry": "US"
},
"telephone": "478-471-0171",
"geo": {
"@type": "GeoCoordinates",
"latitude": "32.9252435",
"longitude": "-83.7145993"
}
}
</script>
I want to add contents of script tag (city, state, lat, long and phone no.) to my result.
Following is my code(incomplete) :
def parse(self,response)
items = MyItem()
tree = Selector(response)
items['city'] = tree.xpath('//script/text()').extract()[0]
items['state'] = tree.xpath('//script/text()').extract()[0]
items['latitude'] = tree.xpath('//script/text()').extract()[0]
items['longitude'] = tree.xpath('//script/text()').extract()[0]
items['telephone'] = tree.xpath('//script/text()').extract()[0]
print(items)
yield items
Can I get any suggestions on how to achieve this?
回答1:
I don't understand what you're trying to do with the repeated xpath queries //item/title/text()
. Note that xpath is useful for extracting HTML content. The content of the <script>
tag in your question is not HTML, so it's not possible to query that with xpath.
In a first step you can get the content of the <script>
tag:
content = tree.xpath('//script/text()').extract()[0]
And then you can use the json
package to load the json content into a Python dictionary:
d = json.loads(content)
Also note that the JSON in the <script>
in your example is not valid,
it's missing a closing brace.
The above method only works with valid content.
来源:https://stackoverflow.com/questions/49327937/how-to-get-contents-of-html-script-tag