urlopen

Is there a way to scrape Amazon Product Listing page using Python?

霸气de小男生 提交于 2019-12-08 06:45:53
问题 I'm trying to scrape product listing pages that display the vendors and prices of particular products, but urllib.urlopen isn't working--it will work on all other pages on Amazon, but I'm kind of wondering if Amazon's bots prevent scraping on product listing pages. Can anyone verify this? Using Chrome I can still view page source... Here's an example of a product listing page I would want to scrape: http://www.amazon.com/gp/offer-listing/B007E84H96/ref=dp_olp_new?ie=UTF8&condition=new 回答1:

How to pass parameter to Url with Python urlopen

浪尽此生 提交于 2019-12-08 00:08:27
问题 I'm currently new to python programming. My problem is that my python program doesn't seem to pass/encode the parameter properly to the ASP file that I've created. This is my sample code: import urllib.request url = 'http://www.sample.com/myASP.asp' full_url = url + "?data='" + str(sentData).replace("'", '"').replace(" ", "%20").replace('"', "%22") + "'" print (full_url) response = urllib.request.urlopen(full_url) print(response) the output would give me something like: http://www.sample.com

unbuffered urllib2.urlopen

匆匆过客 提交于 2019-12-07 11:17:29
I have client for web interface to long running process. I'd like to have output from that process to be displayed as it comes. Works great with urllib.urlopen() , but it doesn't have timeout parameter. On the other hand with urllib2.urlopen() the output is buffered. Is there a easy way to disable that buffer? A quick hack that has occurred to me is to use urllib.urlopen() with threading.Timer() to emulate timeout. But that's only quick and dirty hack. urllib2 is buffered when you just call read() you could define a size to read and therefore disable buffering. for example: import urllib2

Using urlopen to open list of urls

允我心安 提交于 2019-12-06 15:26:29
问题 I have a python script that fetches a webpage and mirrors it. It works fine for one specific page, but I can't get it to work for more than one. I assumed I could put multiple URLs into a list and then feed that to the function, but I get this error: Traceback (most recent call last): File "autowget.py", line 46, in <module> getUrl() File "autowget.py", line 43, in getUrl response = urllib.request.urlopen(url) File "/usr/lib/python3.2/urllib/request.py", line 139, in urlopen return opener

How to reliably process web-data in Python

戏子无情 提交于 2019-12-06 13:16:51
问题 I'm using the following code to get data from a website: time_out = 4 def tryconnect(turl, timer=time_out, retries=10): urlopener = None sitefound = 1 tried = 0 while (sitefound != 0) and tried < retries: try: urlopener = urllib2.urlopen(turl, None, timer) sitefound = 0 except urllib2.URLError: tried += 1 if urlopener: return urlopener else: return None [...] urlopener = tryconnect('www.example.com') if not urlopener: return None try: for line in urlopener: do stuff except httplib

How to pass parameter to Url with Python urlopen

浪尽此生 提交于 2019-12-06 11:21:45
I'm currently new to python programming. My problem is that my python program doesn't seem to pass/encode the parameter properly to the ASP file that I've created. This is my sample code: import urllib.request url = 'http://www.sample.com/myASP.asp' full_url = url + "?data='" + str(sentData).replace("'", '"').replace(" ", "%20").replace('"', "%22") + "'" print (full_url) response = urllib.request.urlopen(full_url) print(response) the output would give me something like: http://www.sample.com/myASP.asp?data='{%22mykey%22:%20[{%22idno%22:%20%22id123%22,%20%22name%22:%20%22ej%22}]}' The asp file

Urllib's urlopen breaking on some sites (e.g. StackApps api): returns garbage results

好久不见. 提交于 2019-12-05 02:31:08
I'm using urllib2 's urlopen function to try and get a JSON result from the StackOverflow api. The code I'm using: >>> import urllib2 >>> conn = urllib2.urlopen("http://api.stackoverflow.com/0.8/users/") >>> conn.readline() The result I'm getting: '\x1f\x8b\x08\x00\x00\x00\x00\x00\x04\x00\xed\xbd\x07`\x1cI\x96%&/m\xca{\x7fJ\... I'm fairly new to urllib, but this doesn't seem like the result I should be getting. I've tried it in other places and I get what I expect (the same as visiting the address with a browser gives me: a JSON object). Using urlopen on other sites (e.g. " http://google.com "

python urllib2.urlopen(url) process block

落花浮王杯 提交于 2019-12-04 22:54:59
I am using urllib2.urlopen() and my process is getting blocked I am aware that urllib2.urlopen() has default timeout. How to make the call unblockable? The backtrace is (gdb) bt #0 0x0000003c6200dc35 in recv () from /lib64/libpthread.so.0 #1 0x00002b88add08137 in ?? () from /usr/lib64/python2.6/lib-dynload/_socketmodule.so #2 0x00002b88add0830e in ?? () from /usr/lib64/python2.6/lib-dynload/_socketmodule.so #3 0x000000310b2d8e19 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0 If your problem is that you need to urllib to finish reading read() operation is blocking operation in

How to reliably process web-data in Python

拜拜、爱过 提交于 2019-12-04 18:25:22
I'm using the following code to get data from a website: time_out = 4 def tryconnect(turl, timer=time_out, retries=10): urlopener = None sitefound = 1 tried = 0 while (sitefound != 0) and tried < retries: try: urlopener = urllib2.urlopen(turl, None, timer) sitefound = 0 except urllib2.URLError: tried += 1 if urlopener: return urlopener else: return None [...] urlopener = tryconnect('www.example.com') if not urlopener: return None try: for line in urlopener: do stuff except httplib.IncompleteRead: print 'incomplete' return None except socket.timeout: print 'socket' return None return stuff Is

python urllib2 urlopen response

浪子不回头ぞ 提交于 2019-12-04 10:03:15
问题 python urllib2 urlopen response: <addinfourl at 1081306700 whose fp = <socket._fileobject object at 0x4073192c>> expected: {"token":"mYWmzpunvasAT795niiR"} 回答1: You need to bind the resultant file-like object to a variable, otherwise the interpreter just dumps it via repr : >>> import urllib2 >>> urllib2.urlopen('http://www.google.com') <addinfourl at 18362520 whose fp = <socket._fileobject object at 0x106b250>> >>> >>> f = urllib2.urlopen('http://www.google.com') >>> f <addinfourl at