urllib

Python: urlretrieve PDF downloading

馋奶兔 提交于 2019-12-04 14:33:17
I am using urllib's urlretrieve() function in Python in order to try to grab some pdf's from websites. It has (at least for me) stopped working and is downloading damaged data (15 KB instead of 164 KB). I have tested this with several pdf's, all with no success (ie random.pdf ). I can't seem to get it to work, and I need to be able to download pdf's for the project I am working on. Here is an example of the kind of code I am using to download the pdf's (and parse the text using pdftotext.exe ): def get_html(url): # gets html of page from Internet import os import urllib2 import urllib from

page scraping to get prices from google finance

元气小坏坏 提交于 2019-12-04 13:25:51
问题 I am trying to get stock prices by scraping google finance pages, I am doing this in python, using urllib package and then using regex to get price data. When I leave my python script running, it works initially for some time (few minutes) and then starts throwing exception [HTTP Error 503: Service Unavailable] I guess this is happening because on web server side it detects frequent page updates as a robot and throws this exception after a while.. is there a way around this, i.e. deleting

python 之 Urllib库的基本使用

非 Y 不嫁゛ 提交于 2019-12-04 12:06:18
目录 python 之 Urllib库的基本使用 官方文档 什么是Urllib urlopen url参数的使用 data参数的使用 timeout参数的使用 响应 响应类型、状态码、响应头 request 异常处理 URL解析 功能一: urlunpars urljoin urlencode python 之 Urllib库的基本使用 官方文档 https://docs.python.org/3/library/urllib.html 什么是Urllib Urllib是python内置的HTTP请求库 包括以下模块 urllib.request 请求模块 urllib.error 异常处理模块 urllib.parse url解析模块 urllib.robotparser robots.txt解析模块 urlopen 关于urllib.request.urlopen参数的介绍: urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None) url参数的使用 先写一个简单的例子: import urllib.request response = urllib.request.urlopen('http://www.baidu

Memory usage with concurrent.futures.ThreadPoolExecutor in Python3

橙三吉。 提交于 2019-12-04 11:58:32
问题 I am building a script to download and parse benefits information for health insurance plans on Obamacare exchanges. Part of this requires downloading and parsing the plan benefit JSON files from each individual insurance company. In order to do this, I am using concurrent.futures.ThreadPoolExecutor with 6 workers to download each file (with urllib), parse and loop thru the JSON and extract the relevant info (which is stored in nested dictionary within the script). (running Python 3.5.1 (v3.5

Setting proxy to urllib.request (Python3)

不问归期 提交于 2019-12-04 11:24:06
How can I set proxy for the last urllib in Python 3. I am doing the next from urllib import request as urlrequest ask = urlrequest.Request(url) # note that here Request has R not r as prev versions open = urlrequest.urlopen(req) open.read() I tried adding proxy as follows : ask=urlrequest.Request.set_proxy(ask,proxies,'http') However I don't know how correct it is since I am getting the next error: 336 def set_proxy(self, host, type): --> 337 if self.type == 'https' and not self._tunnel_host: 338 self._tunnel_host = self.host 339 else: AttributeError: 'NoneType' object has no attribute 'type'

Python: Log in a website using urllib

谁说我不能喝 提交于 2019-12-04 10:48:46
问题 I want to log in to this website: https://www.fitbit.com/login This is my code I use: import urllib2 import urllib import cookielib login_url = 'https://www.fitbit.com/login' acc_pwd = {'login':'Log In','email':'username','password':'pwd'} cj = cookielib.CookieJar() ## add cookies opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) opener.addheaders = [('User-agent','Mozilla/5.0 \ (compatible; MSIE 6.0; Windows NT 5.1)')] data = urllib.urlencode(acc_pwd) try: opener.open(login_url

urllib3 maxretryError

偶尔善良 提交于 2019-12-04 09:25:55
I have just started using urllib3, and I am running into a problem straightaway. According to their manuals, I started off with the simple example: Python 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53) [GCC 4.5.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import urllib3 >>> >>> http = urllib3.PoolManager() >>> r = http.request('GET', 'http://google.com/') I get thrown the following error: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python2.7/dist-packages/urllib3/request.py", line 65, in request **urlopen_kw

parse query string with urllib in Python 2.4

让人想犯罪 __ 提交于 2019-12-04 09:19:00
问题 Using Python2.4.5 (don't ask!) I want to parse a query string and get a dict in return. Do I have to do it "manually" like follows? >>> qs = 'first=1&second=4&third=3' >>> d = dict([x.split("=") for x in qs.split("&")]) >>> d {'second': '4', 'third': '3', 'first': '1'} Didn't find any useful method in urlparse . 回答1: You have two options: >>> cgi.parse_qs(qs) {'second': ['4'], 'third': ['3'], 'first': ['1']} or >>> cgi.parse_qsl(qs) [('first', '1'), ('second', '4'), ('third', '3')] The values

Python: Post Request with image files

倾然丶 夕夏残阳落幕 提交于 2019-12-04 08:56:11
I have a server and I am trying to build a post request to get the data back. I think one way to achieve this is to add the parameters in the header and make the request. But I am getting few errors that I don't understand well enough to go forward. Html Form <html> <head> <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"> </head> <body> <form method="POST" action="http://some.server.com:61235/imgdigest" enctype="multipart/form-data"> quality:<input type="text" name="quality" value="2"><br> category:<input type="text" name="category" value="1"><br> debug:<input type="text

Looping through a directory on the web and displaying its contents (files and other directories) via Python

人走茶凉 提交于 2019-12-04 06:39:51
问题 In the same vein as Process a set of files from a source directory to a destination directory in Python I'm wondering if it is possible to create a function that when given a web directory it will list out the files in said directory. Something like... files[] for file in urllib.listdir(dir): if file.isdir: # handle this as directory else: # handle as file I assume I would need to use the urllib library, but there doesn't seem to be an easy way of doing this, that I've seen at least. 回答1: