urllib | 易学教程

百度百科基础爬虫

阅读更多关于百度百科基础爬虫

出处：Python爬虫开发与系项目实战作者：范传辉基础爬虫框架爬虫调度器：统筹别的四个模块 URL管理器：维护已经爬取了的url集合和获得新的未爬取的url链接 HTML下载器：从URL管理器中，获取url，并下载html网页 HTML解析器：从下载器中，截取有效数据数据存储器：将有效数据进行存储 1. URL管理器 URLManager.py 去重不去重的后果：链接重复容易造成死循环方法：（1）内存去重（2）关系数据库去重（3）缓存数据库去重。在小型中采用set，容易去重 URL管理器应有的接口：方法名称方法功能 has_new_url() 判断是否有待取的url add_new_url (url) 添加新的url到未爬去的集合中 add_new_urls (urls) get_new_url( ) 获取一个未爬去的url new_url_size( ) 未爬取的url的集合的大小 old_url_size( ) 已爬去的url的集合大小具体代码： class URLManager: def __init__(self): self.new_urls=set() self.old_urls=set() def has_new_url(self,url): return self.new_url_size()!=0 def add_new_url(self

The default path of python urlretrieve downloading file via HTTP

阅读更多关于 The default path of python urlretrieve downloading file via HTTP

问题 We know that we could use urllib.urlretrieve to download file via HTTP to local file system. For example: import urllib urllib.urlretrieve ("http://www.example.com/songs/mp3.mp3", "mp3.mp3") I wonder where is the default path if we download a file like mp3.mp3 ? I have read python document. urllib.urlretrieve(url[, filename[, reporthook[, data]]]) Copy a network object denoted by a URL to a local file, if necessary. If the URL points to a local file, or a valid cached copy of the object

How can I retrieve files with User-Agent headers in Python 3?

阅读更多关于 How can I retrieve files with User-Agent headers in Python 3?

问题 I'm trying to write a (simple) piece of code to download files off the internet. The problem is, some of these files are on websites that block the default python User-Agent headers. For example: import urllib.request as html html.urlretrieve('http://stackoverflow.com', 'index.html') returns urllib.error.HTTPError: HTTP Error 403: Forbidden` Normally, I would set the headers in the request, such as: import urllib.request as html request = html.Request('http://stackoverflow.com', headers={

Remove 'urllib.error.HTTPError: HTTP Error 302:' from urlReq(url)

阅读更多关于 Remove 'urllib.error.HTTPError: HTTP Error 302:' from urlReq(url)

问题 Hey guys what's up? :) I'm trying to scrape a website with some url parameters. If I use url1, url2, url3 it WORKS properly and it prints me the regular output I want (html) -> import bs4 from urllib.request import urlopen as urlReq from bs4 import BeautifulSoup as soup # create urls url1 = 'https://en.titolo.ch/sale' url2 = 'https://en.titolo.ch/sale?limit=108' url3 = 'https://en.titolo.ch/sale?category_styles=29838_21212' url4 = 'https://en.titolo.ch/sale?category_styles=31066&limit=108' #

Python 3.4 HTTP Error 505 retrieving json from url

阅读更多关于 Python 3.4 HTTP Error 505 retrieving json from url

问题 I am trying to connect to a page that takes in some values and return some data in JSON format in Python 3.4 using urllib. I want to save the values returned from the json into a csv file. This is what I tried... import json import urllib.request url = 'my_link/select?wt=json&indent=true&f=value' response = urllib.request.Request(url) response = urllib.request.urlopen(response) data = response.read() I am getting an error below: urllib.error.HTTPError: HTTP Error 505: HTTP Version Not

How to send a request without 'Host Header' using Python?

阅读更多关于 How to send a request without 'Host Header' using Python?

问题 I have been trying for many days now, so here I am finally asking, may be dumb question for most of the experts. I am using PyUnit for API testing of my application. The application (to be tested) is deployed on one of the local servers over here. The application prevents hackers from doing malicious activities. So I am accessing any website (protected by this application) through this application. e.g. http://my-security-app/stackoverflow/login , http://my-security-app/website-to-be

How to follow a redirect with urllib?

阅读更多关于 How to follow a redirect with urllib?

问题 I'm creating a script in Python 3 which access a page like: example.com/daora/zz.asp?x=qqrzzt using the urllib.request.urlopen("example.com/daora/zz.asp?x=qqrzzt"), but this code just give me the same page(example.com/daora/zz.asp?x=qqrzzt) and on the browser i get a redirect to a page like: example.com/egg.aspx What could i do to retrieve the example.com/egg.aspx and not the example.com/daora/zz.asp?x=qqrzzt I think this is relevant code, this is the code from "example.com/daora/zz.asp?x

Timeout error when downloading .html files from urls

阅读更多关于 Timeout error when downloading .html files from urls

问题 I get the following an error when downloading html pages from the urls. Error: raise URLError(err) urllib2.URLError: <urlopen error [Errno 10060] A connection attempt failed because the connected party did not properly respond after a period of time or established connection failed because connected host has failed to respond> Code: import urllib2 hdr = {'User-Agent': 'Mozilla/5.0'} for i,site in enumerate(urls[index]): print (site) req = urllib2.Request(site, headers=hdr) page = urllib2

urlretrieve returning typeerror

阅读更多关于 urlretrieve returning typeerror

问题 I don't know why my code is returning this error, I can't seem to debug it. TypeError: expected string or bytes-like object Here is what I'm using to download self.headers = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' } self.request = urllib.request.Request(url, headers=self.headers) urllib.request.urlretrieve(self.request, reporthook=report) 回答1: It appears that urlretrieve doesn't allow the sending of headers. And the error you're getting is because

Python urllib simple login script

阅读更多关于 Python urllib simple login script

问题 I am trying to make a script to login into my "check card balance" service for my university using python. Basically it's a web form where we fill-in our PIN and PASS and it shows us how much $$$ is left on our card (for food)... This is the webpage: [url]http://www.wcu.edu/11407.asp[/url] This is the form I am filling: <FORM method=post action=https://itapp.wcu.edu/BanAuthRedirector/Default.aspx><INPUT value=https://cf.wcu.edu/busafrs/catcard/idsearch.cfm type=hidden name=wcuirs_uri> <P><B