问题
I'm trying to scrape new stories from Reddit using their API and Python's urllib2, but I keep getting JSON documents like this one:
{ u'kind': u'Listing', u'data': { u'modhash': u'', u'children': [], u'after': None, u'before': None }}
Here is my code:
import json
import time
import urllib2
def get_submissions(after=None):
url = 'http://reddit.com/r/all/new.json?limit=100'
if after:
url += '&after=%s' % after
_user_agent = 'Reddit Link Analysis Bot by PirateLogic @ github.com/jamesbrewer'
_request = urllib2.Request(url, headers={'User-agent': _user_agent})
_json = json.loads(urllib2.urlopen(_request).read())
return [story for story in _json['data']['children']], _json['data']['after']
if __name__ == '__main__':
after = None
stories = []
limit = 1
while len(stories) < limit:
new_stories, after = get_submissions(after)
stories.extend(new_stories)
time.sleep(2) # The Reddit API allows one request every two seconds.
print '%d stories collected so far .. sleeping for two seconds.' % len(stories)
What I've written is fairly short and straight-forward, but I'm obviously overlooking something or I don't have a complete understanding of the API or how urllib2 works.
Here's an example page from the API.
What's the deal?
EDIT After trying to load the example page in another browser, I'm also seeing the JSON I posted at the top of the page. It seems to be only for //new.json though. If I try //hot.json or just /.json, I get what I want.
回答1:
Edit: As of 2013/02/22, the desired new
sort no longer requires sort=new
to be added as a URL parameter. This is because the rising
sort is no longer provided under the /new
route, but is provided by /rising
[source].
The problem with the URL http://reddit.com/r/all/new.json?limit=100 is that the new
pages by default use the rising
sort. If you are logged in, and you have changed the default sort to new
then what you really see is the result for the page http://reddit.com/r/all/new.json?limit=100&sort=new. Notice the addition of the parameter sort=new
.
Thus the result is correct, it is just that the rising view has not been updated for /r/all.
On a related note, I strongly suggest using PRAW (the python reddit API wrapper) rather than writing your own code if you plan to use more than just a single part of the API. Here's the relevant code that you want:
import praw
r = praw.Reddit('YOUR DESCRIPTIVE USER AGENT NAME')
listing = list(r.get_subreddit('all').get_new_by_date())
print listing
If you simply want to iterate over the submissions you can omit the list()
part.
回答2:
I was stumped on a similar (not the same as OP) problem for a while - no children
in the API response. I figured I'd post this in case it's helpful to others getting to this question via a search engine:
If I open this url in my browser:
https://www.reddit.com/comments.json?limit=100
It seems to work fine, but when I send a request it returns no children. Tried playing with the user-agent of the request and stuff like that to no avail. Ended up using the /r/all
comment stream instead:
https://www.reddit.com/r/all/comments.json?limit=100
Works fine in the browser and via a programmatic request. Still have no idea why the first url doesn't work.
回答3:
http://www.reddit.com/r/all.json?limit=100 returns meaningful data
http://reddit.com/r/all/new?limit=100 (no .json) says there are no items...
It looks like reddit doesn't use /new how you think it does so the problem is in your use of the api.
If this answer is not sufficient please include a link to the reddit api docs.
Also, here's a quick note on REST. It looks like reddit is RESTful (I stand to be corrected but that's what my experiments here tell me...). This means that by dropping the .json extension on any of the urls you are trying to access should give you human-friendly versions of the same data. This could be useful during testing. Just look at stuff with your browser and you will see what info reddit thinks you are asking for.
来源:https://stackoverflow.com/questions/13328798/reddit-api-returning-useless-json