问题
I'm writing a python program to crawl twitter using a combination of urllib2, the python twitter wrapper for the api, and BeautifulSoup. However, when I run my program, I get an error of the following type:
ray_krueger RafaelNadal
Traceback (most recent call last):
File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler\crawlerversion9.py", line 78, in <module>
crawl(start_follower, output, depth)
File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler\crawlerversion9.py", line 74, in crawl
crawl(y, output, in_depth - 1)
File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler\crawlerversion9.py", line 74, in crawl
crawl(y, output, in_depth - 1)
File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler\crawlerversion9.py", line 64, in crawl
request = urllib2.Request(new_url)
File "C:\Python28\lib\urllib2.py", line 192, in __init__
self.__original = unwrap(url)
File "C:\Python28\lib\urllib.py", line 1038, in unwrap
url = url.strip()
AttributeError: 'NoneType' object has no attribute 'strip'
I'm completely unfamiliar with this type of error (new to python) and searching for it online has yielded very little information. I've attached my code as well, but do you have any suggestions?
Thanx Snehizzy
import twitter
import urllib
import urllib2
import htmllib
from BeautifulSoup import BeautifulSoup
import re
start_follower = "NYTimeskrugman"
depth = 3
output = open(r'C:\Python27\outputtest.txt', 'a') #better to use SQL database thanthis
api = twitter.Api()
#want to also begin entire crawl with some sort of authentication service
def site(follower):
followersite = "http://mobile.twitter.com/" + follower
return followersite
def getPage(follower):
thisfollowersite = site(follower)
request = urllib2.Request(thisfollowersite)
response = urllib2.urlopen(request)
return response
def getSoup(response):
html = response.read()
soup = BeautifulSoup(html)
return soup
def get_more_tweets(soup):
links = soup.findAll('a', {'href': True}, {id : 'more_link'})
for link in links:
b = link.renderContents()
if str(b) == 'more':
c = link['href']
d = 'http://mobile.twitter.com' +c
return d
def recordlinks(soup,output):
tags = soup.findAll('div', {'class' : "list-tweet"})#to obtain tweet of a follower
for tag in tags:
a = tag.renderContents()
b = str (a)
output.write(b)
output.write('\n\n')
def checkforstamp(soup):
times = nsoup.findAll('a', {'href': True}, {'class': 'status_link'})
for time in times:
stamp = time.renderContents()
if str(stamp) == '3 months ago':
return True
def crawl(follower, output, in_depth):
if in_depth > 0:
output.write(follower)
a = getPage(follower)
new_soup = getSoup(a)
recordlinks(new_soup, output)
currenttime = False
while currenttime == False:
new_url = get_more_tweets(new_soup)
request = urllib2.Request(new_url)
response = urllib2.urlopen(request)
new_soup = getSoup(response)
recordlinks(new_soup, output)
currenttime = checkforstamp(new_soup)
users = api.GetFriends(follower)
for u in users[0:5]:
x = u.screen_name
y = str(x)
print y
crawl(y, output, in_depth - 1)
output.write('\n\n')
output.write('\n\n\n')
crawl(start_follower, output, depth)
print("Program done. Look at output file.")
回答1:
When you do
request = urllib2.Request(new_url)
in crawl()
, new_url
is None
. As you're getting new_url
from get_more_tweets(new_soup)
, that means get_more_tweets()
is returning None
.
That means return d
is never being reached, which means either str(b) == 'more'
was never true, or soup.findAll()
didn't return any links so for link in links
does nothing.
回答2:
AttributeError: 'NoneType' object has no attribute 'strip'
It means exactly what it says: url.strip()
requires first figuring out what url.strip
is, i.e. looking up the strip
attribute of url
. This failed because url
is a 'NoneType' object
, i.e. an object whose type is NoneType
, i.e. the special object None
.
Presumably url
was expected to be a str
, i.e. a text string, since those do have a strip
attribute.
This happened within File "C:\Python28\lib\urllib.py"
, i.e., the urllib
module. That's not your code, so we look backwards through the exception trace until we find something we wrote: request = urllib2.Request(new_url)
. We can only presume that the new_url
that we pass to the urllib2
module eventually becomes a url
variable somewhere within urllib
.
So where did new_url
come from? We look up the line of code in question (notice that there is a line number in the exception traceback), and we see that the immediately previous line is new_url = get_more_tweets(new_soup)
, so we're using the result for get_more_tweets
.
An analysis of this function shows that it searches through some links, tries to find one labelled 'more', and gives us the URL for the first such link that it finds. The case we haven't considered is when there are no such links. In this case, the function just reaches the end, and implicitly returns None (that's how Python handles functions that reach the end without an explicit return, since there is no specification of a return type in Python and since a value must always be returned), which is where that value is coming from.
Presumably, if there is no 'more' link, then we should not be attempting to follow the link at all. Therefore, we fix the error by explicitly checking for this None
return value, and skipping the urllib2.Request
in that case, since there is no link to follow.
By the way, this None
value would be a more idiomatic "placeholder" value for the not-yet-determined currenttime
than the False
value that you are currently using. You might also consider being a little more consistent about separating words with underscores in your variable and method names to make things easier to read. :)
回答3:
When you are doing: request = urllib2.Request(new_url)
, new_url
supposed to be a string, this error says it's None
.
You get new_url's value from get_more_tweets
function, so, it returned None
somewhere.
def get_more_tweets(soup):
links = soup.findAll('a', {'href': True}, {id : 'more_link'})
for link in links:
b = link.renderContents()
if str(b) == 'more':
c = link['href']
d = 'http://mobile.twitter.com' +c
return d
When we look at this code, the function returns only when str(b)=="more"
on some link, so your problem is "Why never str(b)=="more" happens?".
回答4:
You're passing None
rather than a string to urllib2.Request()
. Looking at the code, this means that new_url
is None
sometimes. And looking at your get_more_tweets()
function, which is the source of this variable, we see this:
def get_more_tweets(soup):
links = soup.findAll('a', {'href': True}, {id : 'more_link'})
for link in links:
b = link.renderContents()
if str(b) == 'more':
c = link['href']
d = 'http://mobile.twitter.com' +c
return d
This function is returning a value only if b
is "more"
because your return
statement is indented under your if
. If it is equal to any other value, no value (i.e. None
) is returned.
You need to either always return a valid URL here, or you need to check for the None
return value before passing it to urllib2.Request()
.
来源:https://stackoverflow.com/questions/6919098/attributeerror-nonetype-object-has-no-attribute-strip-with-python-webcrawle