How to build a Python crawler for websites using oauth2

I'm new in web programming. I want to build a crawler for crawling the social graph in Foursquare by Python. I've got a "manually" controlled crawler by using the apiv2 library. The main method is like:

def main():
    CODE = "******"
    url = "https://foursquare.com/oauth2/authenticate?client_id=****&response_type=code&redirect_uri=****"
    key = "***"
    secret = "****"
    re_uri = "***"

    auth = apiv2.FSAuthenticator(key, secret, re_uri)
    auth.set_token(code)    
    finder = apiv2.UserFinder(auth)        

    #DO SOME REQUIRES By USING THE FINDER
    finder.finde(ANY_USER_ID).mayorships()
    bla bla bla

The problem is that at present, I have to type the URL in my browser and pick up the CODE from the redirect URL, and then update the CODE in my program, and run it again. I think there might be some way that I can code the CODE taking progress into my current program and make it automatic.

Any instruction or sample code is appreciated.

You should check out the python-oauth2 module. It seems to be the most stable thing out there.

In particular, this blog post has a really good run down on how to do Oauth easily with Python. The example code uses the Foursquare API, so I would check that out first.

I recently had to get oauth working with Dropbox, and wrote this module containing the necessary steps to do oauth exchange.

For my system, the simplest thing I could think of was to pickle the Oauth client. My blog package just deserialized the pickled client and requested endpoints with the following function:

get = lambda x: client.request(x, 'GET')[1]

Just makes sure your workers have this client object and you should be good to go :-)

Get your app authenticated by oauth2 first. This is an example of how to use oauth for twitter authentication. http://popdevelop.com/2010/07/an-example-on-how-to-use-oauth-and-python-to-connect-to-twitter/

Similarly, you can find more examples, at https://code.google.com

Then you can use BeautifulSoup or lxml for html parsing. You can extract relevant data from page source that you will get after your request is complete.

BeautifulSoup Documentation - http://www.crummy.com/software/BeautifulSoup/

To download images, videos, etc you can use openers. Read more about openers on http://docs.python.org/library/urllib2.html

You don't have to do it every time. They'll give you a token that is good for X hours/day. Eventually you'll get 403 http code and you'll need to re-authenticate

来源：https://stackoverflow.com/questions/9038690/how-to-build-a-python-crawler-for-websites-using-oauth2

标签

python

api

oauth-2.0

web-crawler

foursquare