urllib | 易学教程

im trying to get proxies using regex python out of a web page

阅读更多关于 im trying to get proxies using regex python out of a web page

问题 import urllib.request import re page = urllib.request.urlopen("http://www.samair.ru/proxy/ip-address-01.htm").read() re.findall('\d+\.\d+\.\d+\.\d+', page) i dont understand why it says: File "C:\Python33\lib\re.py", line 201, in findall return _compile(pattern, flags).findall(string) TypeError: can't use a string pattern on a bytes-like object 回答1: import urllib import re page = urllib.urlopen("http://www.samair.ru/proxy/ip-address-01.htm").read() print re.findall('\d+\.\d+\.\d+\.\d+', page)

Python: urlretrieve PDF downloading

阅读更多关于 Python: urlretrieve PDF downloading

问题 I am using urllib's urlretrieve() function in Python in order to try to grab some pdf's from websites. It has (at least for me) stopped working and is downloading damaged data (15 KB instead of 164 KB). I have tested this with several pdf's, all with no success (ie random.pdf). I can't seem to get it to work, and I need to be able to download pdf's for the project I am working on. Here is an example of the kind of code I am using to download the pdf's (and parse the text using pdftotext.exe):

How to test if a webpage is an image

阅读更多关于 How to test if a webpage is an image

问题 Sorry that the title wasn't very clear, basically I have a list with a whole series of url's, with the intention of downloading the ones that are pictures. Is there anyway to check if the webpage is an image, so that I can just skip over the ones that arent? Thanks in advance 回答1: You can use requests module. Make a head request and check the content type. Head request will not download the response body. import requests response = requests.head(url) print response.headers.get('content-type')

UnicodeDecodeError: 'utf-8' codec can't decode byte error

阅读更多关于 UnicodeDecodeError: 'utf-8' codec can't decode byte error

问题 I'm trying to get a response from urllib and decode it to a readable format. The text is in Hebrew and also contains characters like { and / top page coding is: # -*- coding: utf-8 -*- raw string is: b'\xff\xfe{\x00 \x00\r\x00\n\x00"\x00i\x00d\x00"\x00 \x00:\x00 \x00"\x001\x004\x000\x004\x008\x003\x000\x000\x006\x004\x006\x009\x006\x00"\x00,\x00\r\x00\n\x00"\x00t\x00i\x00t\x00l\x00e\x00"\x00 \x00:\x00 \x00"\x00\xe4\x05\xd9\x05\xe7\x05\xd5\x05\xd3\x05 \x00\xd4\x05\xe2\x05\xd5\x05\xe8\x05\xe3

Python: Importing urllib.quote

阅读更多关于 Python: Importing urllib.quote

问题 I would like to use urllib.quote() . But python (python3) is not finding the module. Suppose, I have this line of code: print(urllib.quote("châteu", safe='')) How do I import urllib.quote? import urllib or import urllib.quote both give AttributeError: 'module' object has no attribute 'quote' What confuses me is that urllib.request is accessible via import urllib.request 回答1: In Python 3.x, you need to import urllib.parse.quote: >>> import urllib.parse >>> urllib.parse.quote("châteu", safe='')

A specific site is returning a different response on python and in chrome

阅读更多关于 A specific site is returning a different response on python and in chrome

问题 I am trying to access a specific site using python, and no matter which lib I use I just can't seem to access it. I have tried Selenium+PhantomJS, I have tried requests and urllib. Whenever I try to access the site from the browser I get a json file, and whenever I try to access it from a python script I get an html file (which has a huge minified script inside it) I suspect this site is detecting I'm sending the request headlessly and is blocking my requests, but I can't figure out how. The

urllib2.HTTPError: HTTP Error 400: Bad Request - Python

阅读更多关于 urllib2.HTTPError: HTTP Error 400: Bad Request - Python

问题 I'm trying to POST using urllib and urllib2 but it keeps giving me this error Traceback (most recent call last): File "/Users/BaDRaN/Desktop/untitled text.py", line 39, in <module> response = urllib2.urlopen(request) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 127, in urlopen return _opener.open(url, data, timeout) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 410, in open response = meth(req,

Find final redirected url in Python

阅读更多关于 Find final redirected url in Python

问题 import requests def extractlink(): with open('extractlink.txt', 'r') as g: print("opened extractlink.txt for reading") contents = g.read() headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'} r = requests.get(contents, headers=headers) print(("Links to " + r.url)) time.sleep (2) Currently, r.url is just linking to the url found in 'extractlink.txt' I'm looking to fix this script to find the final

web2py url validator

阅读更多关于 web2py url validator

问题 In a shorten-er built by web2by i want to validate url's first, if it's not valid goes back to the first page with an error message. this is my code in controller (mvc arch.) but i don't get what's wrong..!! import urllib def index(): return dict() def random_maker(): url = request.vars.url try: urllib.urlopen(url) return dict(rand_url = ''.join(random.choice(string.ascii_uppercase + string.digits + string.ascii_lowercase) for x in range(6)), input_url=url) except IOError: return index() 回答1:

How to scrape URL data from intranet site using python?

阅读更多关于 How to scrape URL data from intranet site using python?

问题 I need a Python Warrior to help me (I'm a noob)! I'm trying to scrape certain data from an intra-net site using Module urllib. However, since it is my company website that is only available to employees to view and not to the public, I think this is why I get this code: IOError: ('http error', 401, 'Unauthorized', ) How do I come about this? It won't even read the site using htmlfile.read() Sample code to get public site: import urllib import re htmlfile = urllib.urlopen("http://finance.yahoo