How to read the content of an website?

I'm new on web-crawler using python 2.7.

1. Background

Now, I want to collect useful data from AQICN.org which is a great website offering the air quality data all over the world.

I want to use python to get all China's sites data per hour. But I'm stuck right now.

2. My trouble

Take this website(http://aqicn.org/city/shenyang/usconsulate/) for example.

This page offer the air pollution and meteorology parameters of a U.S Consulate in China. Using code like this, I can't get useful information.

import urllib
from bs4 import BeautifulSoup
import re
import json

html_aqi =    
urllib.urlopen("http://aqicn.org/city/shenyang/usconsulate/json").read().decode('utf-8')
soup= BeautifulSoup(html_aqi)
l = soup.p.get_text() 
aqi= json.loads(l)

The result shows like this:

> ValueError: No JSON object could be decoded

So, I change the html_aqi to this format(reference to someone's work):

http://aqicn.org/aqicn/json/android/shenyang/usconsulate/json

The code works well.

3. My target.

format 1: (http://aqicn.org/city/shenyang/usconsulate/json)
format 2: (http://aqicn.org/aqicn/json/android/shenyang/usconsulate/json)

In general, I can deal with the format 2 . But, I have collected websites of all sites in China which in format 1. So, can anyone offer me some help to cope with format 1? Thanks a lot.

Update

format 1 is hard to transformed into the second format(Lots of conditions need to be considered.)

It can't be done easily using code like this:

city_name = url_format1.split("/")[5]
site_name = url_format1.split("/")[6]
url_format2 = "http://aqicn.org/aqicn/json/android/"+ city_name + "/"+    site_name

### --- Reason Why it's hard  in practice  
1559 sites need to be care with, and these sites differ by their location.     
Some are in city, some are in county. Their url are not the same pattern.   
for example: 
Type1 --> http://aqicn.org/city/hebi/json
Type2 --> http://aqicn.org/city/jiangsu/huaian/json
Type3 --> http://aqicn.org/city/china/xinzhou/jiyin/json

If you are interested in the Air Quality Index number, find the div with aqivalue class:

>>> import urllib
>>> from bs4 import BeautifulSoup
>>> 
>>> url = "http://aqicn.org/city/shenyang/usconsulate/json"
>>> soup = BeautifulSoup(urllib.urlopen(url), "html.parser")
>>> soup.find("div", class_="aqivalue").get_text()
u'171'

The first url http://aqicn.org/city/shenyang/usconsulate/json actually does not give back JSON data. It gives back HTML data. If you're really interested in this content, you have to parse the HTML data.

You can do this with Beautifulsoup's HTML parser, though the lxml.html package is somewhat more straightforward.

来源：https://stackoverflow.com/questions/36102858/how-to-read-the-content-of-an-website

标签

python

json

beautifulsoup

web-crawler

urllib