a (presumably basic) web scraping of http://www.ssa.gov/cgi-bin/popularnames.cgi in urllib

问题

I am very new to Python (and web scraping). Let me ask you a question.

Many website actually do not report its specific URLs in Firefox or other browsers. For example, Social Security Admin shows popular baby names with ranks (since 1880), but the url does not change when I change the year from 1880 to 1881. It is constantly,

http://www.ssa.gov/cgi-bin/popularnames.cgi

Because I don't know the specific URL, I could not download the webpage using urllib.

In this page source, it includes:

<input type="text" name="year" id="yob" size="4" value="1880">

So presumably, if I can control this "year" value (like, "1881" or "1991"), I can deal with this problem. Am I right? I still don't know how to do it.

Can anybody tell me the solution for this please?

If you know some websites that may help my study, please let me know.

THANKS!

回答1:

You can still use urllib. The button performs a POST to the current url. Using Firefox's Firebug I took a look at the network traffic and found they're sending 3 parameters: member, top, and year. You can send the same arguments:

import urllib
url = 'http://www.ssa.gov/cgi-bin/popularnames.cgi'

post_params = { # member was blank, so I'm excluding it.
    'top'  : '25',
    'year' : year
    }
post_args = urllib.urlencode(post_params)

Now, just send the url-encoded arguments:

urllib.urlopen(url, post_args)

If you need to send headers as well:

headers = {
    'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language' : 'en-US,en;q=0.5',
    'Connection' : 'keep-alive',
    'Host' : 'www.ssa.gov',
    'Referer' : 'http://www.ssa.gov/cgi-bin/popularnames.cgi',
    'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0'
    }

# With POST data:
urllib.urlopen(url, post_args, headers)

Execute the code in a loop:

for year in xrange(1880, 2014):
    # The above code...

回答2:

I recommend using Scrapy. It's a very powerful and easy-to-use tool for web-scraping. Why it is worth trying:

Speed/performance/efficiency

Scrapy is written with Twisted, a popular event-driven networking framework for Python. Thus, it’s implemented using a non-blocking (aka asynchronous) code for concurrency.
Database pipelining

Scrapy has Item Pipelines feature:

After an item has been scraped by a spider, it is sent to the Item Pipeline which process it through several components that are executed sequentially.

So, each page can be written to the database immediately after it has been downloaded.
Code organization

Scrapy offers you a nice and clear project structure, there you have settings, spiders, items, pipelines etc separated logically. Even that makes your code clearer and easier to support and understand.
Time to code

Scrapy does a lot of work for you behind the scenes. This makes you focus on the actual code and logic itself and not to think about the "metal" part: creating processes, threads etc.

Yeah, you got it - I love it.

In order to get started:

official tutorial
newcoder.io tutorial

Hope that helps.

回答3:

I recommend using a tool such as mechanize. This will allow you to programmatically navigate web pages using python. There are many tutorials on how to use this. Basically, what you'll want to do in mechanize is the same you do in the browser: fill the textbox, hit the "Go" button and parse the webpage you get from the response.

回答4:

I've used mechanoid/BeautifulSoup libraries for stuff like this previously. If I had a project like this now I'd also look at https://github.com/scrapy/scrapy

来源：https://stackoverflow.com/questions/17220997/a-presumably-basic-web-scraping-of-http-www-ssa-gov-cgi-bin-popularnames-cgi

标签

python

cgi

web-scraping

firebug

urllib