urllib

百度百科基础爬虫

﹥>﹥吖頭↗ 提交于 2019-12-11 13:01:45
出处:Python爬虫开发与系项目实战 作者:范传辉 基础爬虫框架 爬虫调度器:统筹别的四个模块 URL管理器:维护已经爬取了的url集合和获得新的未爬取的url链接 HTML下载器:从URL管理器中,获取url,并下载html网页 HTML解析器:从下载器中,截取有效数据 数据存储器:将有效数据进行存储 1. URL管理器 URLManager.py 去重 不去重的后果:链接重复容易造成死循环 方法:(1)内存去重(2)关系数据库去重(3)缓存数据库去重。 在小型中采用set,容易去重 URL管理器应有的接口: 方法名称 方法功能 has_new_url() 判断是否有待取的url add_new_url (url) 添加新的url到未爬去的集合中 add_new_urls (urls) get_new_url( ) 获取一个未爬去的url new_url_size( ) 未爬取的url的集合的大小 old_url_size( ) 已爬去的url的集合大小 具体代码: class URLManager: def __init__(self): self.new_urls=set() self.old_urls=set() def has_new_url(self,url): return self.new_url_size()!=0 def add_new_url(self

The default path of python urlretrieve downloading file via HTTP

拥有回忆 提交于 2019-12-11 12:37:23
问题 We know that we could use urllib.urlretrieve to download file via HTTP to local file system. For example: import urllib urllib.urlretrieve ("http://www.example.com/songs/mp3.mp3", "mp3.mp3") I wonder where is the default path if we download a file like mp3.mp3 ? I have read python document. urllib.urlretrieve(url[, filename[, reporthook[, data]]]) Copy a network object denoted by a URL to a local file, if necessary. If the URL points to a local file, or a valid cached copy of the object

How can I retrieve files with User-Agent headers in Python 3?

喜欢而已 提交于 2019-12-11 12:12:47
问题 I'm trying to write a (simple) piece of code to download files off the internet. The problem is, some of these files are on websites that block the default python User-Agent headers. For example: import urllib.request as html html.urlretrieve('http://stackoverflow.com', 'index.html') returns urllib.error.HTTPError: HTTP Error 403: Forbidden` Normally, I would set the headers in the request, such as: import urllib.request as html request = html.Request('http://stackoverflow.com', headers={

Remove 'urllib.error.HTTPError: HTTP Error 302:' from urlReq(url)

耗尽温柔 提交于 2019-12-11 10:58:51
问题 Hey guys what's up? :) I'm trying to scrape a website with some url parameters. If I use url1, url2, url3 it WORKS properly and it prints me the regular output I want (html) -> import bs4 from urllib.request import urlopen as urlReq from bs4 import BeautifulSoup as soup # create urls url1 = 'https://en.titolo.ch/sale' url2 = 'https://en.titolo.ch/sale?limit=108' url3 = 'https://en.titolo.ch/sale?category_styles=29838_21212' url4 = 'https://en.titolo.ch/sale?category_styles=31066&limit=108' #

Python 3.4 HTTP Error 505 retrieving json from url

独自空忆成欢 提交于 2019-12-11 10:04:47
问题 I am trying to connect to a page that takes in some values and return some data in JSON format in Python 3.4 using urllib. I want to save the values returned from the json into a csv file. This is what I tried... import json import urllib.request url = 'my_link/select?wt=json&indent=true&f=value' response = urllib.request.Request(url) response = urllib.request.urlopen(response) data = response.read() I am getting an error below: urllib.error.HTTPError: HTTP Error 505: HTTP Version Not

How to send a request without 'Host Header' using Python?

大兔子大兔子 提交于 2019-12-11 09:22:39
问题 I have been trying for many days now, so here I am finally asking, may be dumb question for most of the experts. I am using PyUnit for API testing of my application. The application (to be tested) is deployed on one of the local servers over here. The application prevents hackers from doing malicious activities. So I am accessing any website (protected by this application) through this application. e.g. http://my-security-app/stackoverflow/login , http://my-security-app/website-to-be

How to follow a redirect with urllib?

柔情痞子 提交于 2019-12-11 09:02:43
问题 I'm creating a script in Python 3 which access a page like: example.com/daora/zz.asp?x=qqrzzt using the urllib.request.urlopen("example.com/daora/zz.asp?x=qqrzzt"), but this code just give me the same page(example.com/daora/zz.asp?x=qqrzzt) and on the browser i get a redirect to a page like: example.com/egg.aspx What could i do to retrieve the example.com/egg.aspx and not the example.com/daora/zz.asp?x=qqrzzt I think this is relevant code, this is the code from "example.com/daora/zz.asp?x

Timeout error when downloading .html files from urls

China☆狼群 提交于 2019-12-11 09:00:30
问题 I get the following an error when downloading html pages from the urls. Error: raise URLError(err) urllib2.URLError: <urlopen error [Errno 10060] A connection attempt failed because the connected party did not properly respond after a period of time or established connection failed because connected host has failed to respond> Code: import urllib2 hdr = {'User-Agent': 'Mozilla/5.0'} for i,site in enumerate(urls[index]): print (site) req = urllib2.Request(site, headers=hdr) page = urllib2

urlretrieve returning typeerror

僤鯓⒐⒋嵵緔 提交于 2019-12-11 08:54:00
问题 I don't know why my code is returning this error, I can't seem to debug it. TypeError: expected string or bytes-like object Here is what I'm using to download self.headers = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' } self.request = urllib.request.Request(url, headers=self.headers) urllib.request.urlretrieve(self.request, reporthook=report) 回答1: It appears that urlretrieve doesn't allow the sending of headers. And the error you're getting is because

Python urllib simple login script

删除回忆录丶 提交于 2019-12-11 08:16:39
问题 I am trying to make a script to login into my "check card balance" service for my university using python. Basically it's a web form where we fill-in our PIN and PASS and it shows us how much $$$ is left on our card (for food)... This is the webpage: [url]http://www.wcu.edu/11407.asp[/url] This is the form I am filling: <FORM method=post action=https://itapp.wcu.edu/BanAuthRedirector/Default.aspx><INPUT value=https://cf.wcu.edu/busafrs/catcard/idsearch.cfm type=hidden name=wcuirs_uri> <P><B