urllib

urlopen trouble while trying to download a gzip file

ⅰ亾dé卋堺 提交于 2020-01-23 19:39:47
问题 I am going to use the wiktionary dump for the purpose of POS tagging. Somehow it gets stuck when downloading. Here is my code: import nltk from urllib import urlopen from collections import Counter import gzip url = 'http://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-all-titles-in-ns0.gz' fStream = gzip.open(urlopen(url).read(), 'rb') dictFile = fStream.read() fStream.close() text = nltk.Text(word.lower() for word in dictFile()) tokens = nltk.word_tokenize(text) Here is the

Make an http POST request to upload a file using python urllib/urllib2

非 Y 不嫁゛ 提交于 2020-01-20 00:59:05
问题 I would like to make a POST request to upload a file to a web service (and get response) using python. For example, I can do the following POST request with curl : curl -F "file=@style.css" -F output=json http://jigsaw.w3.org/css-validator/validator How can I make the same request with python urllib/urllib2? The closest I got so far is the following: with open("style.css", 'r') as f: content = f.read() post_data = {"file": content, "output": "json"} request = urllib2.Request("http://jigsaw.w3

requests:

元气小坏坏 提交于 2020-01-17 00:51:37
1、保存图片: import requestsurl = "http://dmimg.5054399.com/allimg/pkm/pk/22.jpg"response = requests.get(url = url)查看状态码: print(response.status_code)查看文本数据: print(response.text)查看字节: print(response.content)with open("豆芽.jpg","wb") as f: """wb字节转图片""" f.write(response.content)2、 import requests response = requests.get(url="https://www.cnblogs.com/Neeo/articles/10669652.html%E8%BD%AF%E4%BB%B6%E6%B5%8B%E8%AF%95%E5%87%BA%E7%8E%B0%E5%8E%9F%E5%9B%A0") print(response.text)查看网页编码: print(response.encoding)转码: response.encoding = "utf-8"查看URL: print(response.url)查看响应头: print(response.headers) with open("a

Urllib combined together with elementtree

折月煮酒 提交于 2020-01-16 03:28:11
问题 I'm having a few problems with parsing simple HTML with use of the ElementTree module out of the standard Python libraries. This is my source code: from urllib.request import urlopen from xml.etree.ElementTree import ElementTree import sys def main(): site = urlopen("http://1gabba.in/genre/hardstyle") try: html = site.read().decode('utf-8') xml = ElementTree(html) print(xml) print(xml.findall("a")) except: print(sys.exc_info()) if __name__ == '__main__': main() Either this fails, I get the

Counting HTML images with Python

戏子无情 提交于 2020-01-16 02:54:07
问题 I need some feedback on how to count HTML images with Python 3.01 after extracting them, maybe my regular expression are not used properly. Here is my code: import re, os import urllib.request def get_image(url): url = 'http://www.google.com' total = 0 try: f = urllib.request.urlopen(url) for line in f.readline(): line = re.compile('<img.*?src="(.*?)">') if total > 0: x = line.count(total) total += x print('Images total:', total) except: pass 回答1: A couple of points about your code: It's much

using urllib2 to execute URL and return rendered HTML output, not the HTML itself [duplicate]

和自甴很熟 提交于 2020-01-16 01:01:21
问题 This question already has answers here : Python library for rendering HTML and javascript [closed] (2 answers) Closed 6 years ago . urllib2.urlopen("http://www.someURL.com/pageTracker.html").read(); The code above will return the source HTML at http://www.google.com. What do I need to do to actually return the rendered HTML that you see when you visit google.com? I essentially trying to 'execute' a URL to trigger a view, not retrieve the HTML. To clarify a few things: I'm not actually

Python新手学习(十二)

。_饼干妹妹 提交于 2020-01-14 19:53:45
050 模块就是程序 有三种导入方式 1.import 模块名 2.from 模块名 import 函数名 3.import 模块名 as 新名字 051 模块 052 python标准库 pip install import timeit 053 url+lib=urllib url一般格式为(带方括号[ ] 为可选项) protocol: //hostname[:port] / path /[;parameters][?query]#fragment url由三部分组成: 1.协议:http,https,ftp,file,ed2k 2.是存放资源的服务器的域名系统或IP地址(有时要包含端口号,各种传输协议都有默认的端口号,如http的默认端口为80) 3.资源的具体地址,如目录或文件名等。 import urllib . request resoponse = urllib . request . urlopen ( 'http://www.fish.com' ) html = resoponse . read ( ) html = html . decode ( "utf-8" ) print ( html ) 054 爬一个猫的网站,我发现用校园网爬会出现 TimeoutError: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应

urllib python3 请求、登录、下载网页

岁酱吖の 提交于 2020-01-14 06:49:43
urllib.request 发送request和获取request的结果 urllib.error包含了urllib.request产生的异常 urllib.parse用来解析和处理Url urllib.robotparse用来解析页面的robots.txt文件 urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None) url 可以是一个URL的字符串,也可以是一个Request对象 data 传给服务器的数据,支持字节类型(Bytes)、文件格式(file-like objects)和可迭代的对象(iterables),如果头文件中Content-Length 和 Transfer-Encoding 都没有提供,HTTPHandler就会根据data的类型来设置这些参数。 对于post类型的request method ,data应该以标准的application/x-www-form-urlencoded格式提供,urllib.parse.urlencode()方法可将 字典类或2元素的队列类 数据 转换成这种格式的Ascii字符码。入参data在使用前需要先编码成字节类型。 timeout 用于设置连接时长,只对HTTP

HTTP Authentication in URL with backslash in username

假装没事ソ 提交于 2020-01-14 05:43:07
问题 I need to HTTP Basic Auth for a REST call. In the username I have to provide a domain (which has a hyphen) and then a backslash to separate it from the username, like this: DOM-AIN\user_name . Then the password is pretty benign. This works fine with curl: curl 'https://DOM-AIN\user_name:password@myurl.com' I need to put this into Python now, but I've tried with requests and urllib/2/3...they don't like the \ : or the @ . Even when I URL encode to %40, etc., those get interpreted as an actual

Download from EXPLOSM.net Comics Script [Python]

烂漫一生 提交于 2020-01-14 04:59:43
问题 So I wrote this short script (correct word?) to download the comic images from explosm.net comics because I somewhat-recently found out about it and I want to...put it on my iPhone...3G. It works fine and all. urllib2 for getting webpage html and urllib for image.retrieve() Why I posted this on SO: how do I optimize this code? Would REGEX (regular expressions) make it faster? Is it an internet limitation? Poor algorithm...? Any improvements in speed or general code aesthetics would be greatly