问题
EDIT:(SOLVED) When I am reading the values in from my file a newline char is getting added onto the end.(\n) this is splitting my request string at that point. I think it's to do with how I saved the values to the file in the first place. Many thanks.
I have I have the following code:
results = 'http://www.myurl.com/'+str(mystring)
print str(results)
request = urllib2.Request(results)
request.add_header('User-Agent','Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)')
opener = urllib2.build_opener()
text = opener.open(request).read()
Which is in a loop. after the loop has run a few times str(mystring) changes to give a different set of results. I can loop the script as many times as I like keeping the value of str(mystring) constant but every time I change the value of str(mystring) I get an error saying no host given when the code tries to build the opener.
opener = urllib2.build_opener()
Can anyone help please?
TIA,
Paul.
EDIT:
More code here.....
import sys
import string
import httplib
import urllib2
import re
import random
import time
def StripTags(text):
finished = 0
while not finished:
finished = 1
start = text.find("<")
if start >= 0:
stop = text[start:].find(">")
if stop >= 0:
text = text[:start] + text[start+stop+1:]
finished = 0
return text
mystring="test"
d={}
with open("myfile","r") as f:
while True:
page_counter=0
print str(mystring)
try:
while page_counter <20:
results = 'http://www.myurl.com/'+str(mystring)
print str(results)
request = urllib2.Request(results)
request.add_header('User-Agent','Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)')
opener = urllib2.build_opener()
text = opener.open(request).read()
finds = (re.findall('([\w\.\-]+'+mystring+')',StripTags(text)))
for find in finds:
d[find]=1
uniq_emails=d.keys()
page_counter = page_counter +1
print "found this " +str(finds)"
random.seed()
n = random.random()
i = n * 5
print "Pausing script for " + str(i) + " Seconds" + ""
time.sleep(i)
mystring=next(f)
except IOError:
print "No result found!"+""
回答1:
In the while loop, you're setting results to something which is not a url:
results = 'myurl+str(mystring)'
It should probably be results = myurl+str(mystring)
By the way, it appears there's no need for all the casting to string (str()
) you do:
(expanded on request)
print str(foo)
: in such a case, str() is never necessary. Python will always printfoo's
string representationresults = 'http://www.myurl.com/'+str(mystring)
. This is also unnecessary;mystring
is already a string, so'http://www.myurl.com/' + mystring
would suffice.print "Pausing script for " + str(i) + " Seconds"
. Here you would get an error withoutstr()
since you can't do string + int. However,print "foo", 1, "bar"
does work. As doprint "foo %i bar" % 1
andprint "foo {0} bar".format(1)
(see here)
回答2:
I found the answer. It's as follows....
The values for mystring were read in from a file. In the script I wrote to write the file I opens it with "w" instead of "wb".
Each line in the file ended with a newline character "/n".
When mystring was added to the string request the new line was being created in the middle of the request string.[1]
This would never have been apparent from my code because I changed it to post here in an effort to hide the real url I am using to get my results.[2]
My actual url looks more like this.....
Myurl.com/mystring/otherstuff/page_counter/morestuff.htm
The /n being read from the file spliced my url and gave urllib problems......
[1] I use windows. It adds lots of unseen things to text files. If I'd opened the file to write to with "wb" instead of "w" the contents would have been written without the unseen /n
[2] always post your full code kids. The good people of stackoverflow can't help you unless they can see what you are doing.....
Many thanks all, I hope this helps someone out at some point.
Paul.
来源:https://stackoverflow.com/questions/14649347/urllib2-error-no-host-given