I'm new on web-crawler using python 2.7.
1. Background
Now, I want to collect useful data from AQICN.org which is a great website offering the air quality data all over the world.
I want to use python to get all China's sites data per hour. But I'm stuck right now.
2. My trouble
Take this website(http://aqicn.org/city/shenyang/usconsulate/) for example.
This page offer the air pollution and meteorology parameters of a U.S Consulate in China. Using code like this, I can't get useful information.
import urllib
from bs4 import BeautifulSoup
import re
import json
html_aqi =
urllib.urlopen("http://aqicn.org/city/shenyang/usconsulate/json").read().decode('utf-8')
soup= BeautifulSoup(html_aqi)
l = soup.p.get_text()
aqi= json.loads(l)
The result shows like this:
> ValueError: No JSON object could be decoded
So, I change the html_aqi to this format(reference to someone's work):
http://aqicn.org/aqicn/json/android/shenyang/usconsulate/json
The code works well.
3. My target.
format 1: (http://aqicn.org/city/shenyang/usconsulate/json)
format 2: (http://aqicn.org/aqicn/json/android/shenyang/usconsulate/json)
In general, I can deal with the format 2 . But, I have collected websites of all sites in China which in format 1. So, can anyone offer me some help to cope with format 1? Thanks a lot.
Update
format 1 is hard to transformed into the second format(Lots of conditions need to be considered.)
It can't be done easily using code like this:
city_name = url_format1.split("/")[5]
site_name = url_format1.split("/")[6]
url_format2 = "http://aqicn.org/aqicn/json/android/"+ city_name + "/"+ site_name
### --- Reason Why it's hard in practice
1559 sites need to be care with, and these sites differ by their location.
Some are in city, some are in county. Their url are not the same pattern.
for example:
Type1 --> http://aqicn.org/city/hebi/json
Type2 --> http://aqicn.org/city/jiangsu/huaian/json
Type3 --> http://aqicn.org/city/china/xinzhou/jiyin/json
If you are interested in the Air Quality Index number, find the div
with aqivalue
class:
>>> import urllib
>>> from bs4 import BeautifulSoup
>>>
>>> url = "http://aqicn.org/city/shenyang/usconsulate/json"
>>> soup = BeautifulSoup(urllib.urlopen(url), "html.parser")
>>> soup.find("div", class_="aqivalue").get_text()
u'171'
The first url http://aqicn.org/city/shenyang/usconsulate/json actually does not give back JSON data. It gives back HTML data. If you're really interested in this content, you have to parse the HTML data.
You can do this with Beautifulsoup's HTML parser, though the lxml.html package is somewhat more straightforward.
来源:https://stackoverflow.com/questions/36102858/how-to-read-the-content-of-an-website