urllib

Fetching Image from URL using BeautifulSoup

爷,独闯天下 提交于 2019-12-02 08:14:26
I am trying to fetch important images and not thumbnail or other gifs from the Wikipedia page and using following code. However the "img" is coming as length of "0". any suggestion on how to rectify it. Code : import urllib import urllib2 from bs4 import BeautifulSoup import os html = urllib2.urlopen("http://en.wikipedia.org/wiki/Main_Page") soup = BeautifulSoup(html) imgs = soup.findAll("div",{"class":"image"}) Also if someone can explain in detail that how to use the findAll by looking at "source element" in webpage. That will be awesome. The a tags on the page have an image class, not div :

Find final redirected url in Python

旧城冷巷雨未停 提交于 2019-12-02 08:01:53
import requests def extractlink(): with open('extractlink.txt', 'r') as g: print("opened extractlink.txt for reading") contents = g.read() headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'} r = requests.get(contents, headers=headers) print(("Links to " + r.url)) time.sleep (2) Currently, r.url is just linking to the url found in 'extractlink.txt' I'm looking to fix this script to find the final redirected url and print the result. It appears the issue lies somewhere in the request for the URL, despite

Python urllib cache

不打扰是莪最后的温柔 提交于 2019-12-02 07:57:55
I'm writing a script in Python that should determine if it has internet access. import urllib CHECK_PAGE = "http://64.37.51.146/check.txt" CHECK_VALUE = "true\n" PROXY_VALUE = "Privoxy" OFFLINE_VALUE = "" page = urllib.urlopen(CHECK_PAGE) response = page.read() page.close() if response.find(PROXY_VALUE) != -1: urllib.getproxies = lambda x = None: {} page = urllib.urlopen(CHECK_PAGE) response = page.read() page.close() if response != CHECK_VALUE: print "'" + response + "' != '" + CHECK_VALUE + "'" # else: print "You are online!" I use a proxy on my computer, so correct proxy handling is

Getting a file from an authenticated site (with python urllib, urllib2)

99封情书 提交于 2019-12-02 05:06:16
问题 I'm trying to get a queried-excel file from a site. When I enter the direct link, it will lead to a login page and once I've entered my username and password, it will proceed to download the excel file automatically. I am trying to avoid installing additional module that's not part of the standard python (This script will be running on a "standardize machine" and it won't work if the module is not installed) I've tried the following but I see a "page login" information in the excel file

web2py url validator

泄露秘密 提交于 2019-12-02 03:24:10
In a shorten-er built by web2by i want to validate url's first, if it's not valid goes back to the first page with an error message. this is my code in controller (mvc arch.) but i don't get what's wrong..!! import urllib def index(): return dict() def random_maker(): url = request.vars.url try: urllib.urlopen(url) return dict(rand_url = ''.join(random.choice(string.ascii_uppercase + string.digits + string.ascii_lowercase) for x in range(6)), input_url=url) except IOError: return index() BigHandsome Couldn't you check the http response code using httplib. If it was 200 then the page is valid,

urllib.error.URLError: <urlopen error [Errno 11002] getaddrinfo failed>?

╄→гoц情女王★ 提交于 2019-12-02 03:17:10
问题 So, My code is only 4 lines. I am trying to connect to a website, what I am trying to do after that is irrelevant because the error arised without the other codes. import urllib.request from bs4 import BeautifulSoup html=urllib.request.urlopen('http://python-data.dr-chuck.net/known_by_Fikret.html').read() soup=BeautifulSoup(html,'html.parser') and the error(succinctly summarized one): for res in _socket.getaddrinfo(host, port, family, type, proto, flags): socket.gaierror: [Errno 11002]

Getting a file from an authenticated site (with python urllib, urllib2)

佐手、 提交于 2019-12-02 02:57:18
I'm trying to get a queried-excel file from a site. When I enter the direct link, it will lead to a login page and once I've entered my username and password, it will proceed to download the excel file automatically. I am trying to avoid installing additional module that's not part of the standard python (This script will be running on a "standardize machine" and it won't work if the module is not installed) I've tried the following but I see a "page login" information in the excel file itself :-| import urllib url = "myLink_queriedResult/result.xls" urllib.urlretrieve(url,"C:\\test.xls") SO..

Using PDFMiner (Python) with online pdf files. Encode the url?

风格不统一 提交于 2019-12-02 00:08:01
问题 I am wishing to extract the content of pdf files available online using PDFMiner . My code is based on the one available in the documentation used to extract the content of PDF files on the hard disk: # Open a PDF file. fp = open('mypdf.pdf', 'rb') # Create a PDF parser object associated with the file object. parser = PDFParser(fp) # Create a PDF document object that stores the document structure. document = PDFDocument(parser) That works quite well with some small changes. Now, I have tried

Multi threaded web scraper using urlretrieve on a cookie-enabled site

若如初见. 提交于 2019-12-01 23:44:26
I am trying to write my first Python script, and with lots of Googling, I think that I am just about done. However, I will need some help getting myself across the finish line. I need to write a script that logs onto a cookie-enabled site, scrape a bunch of links, and then spawn a few processes to download the files. I have the program running in single-threaded, so I know that the code works. But, when I tried to create a pool of download workers, I ran into a wall. #manager.py import Fetch # the module name where worker lives from multiprocessing import pool def FetchReports(links,Username

Using PDFMiner (Python) with online pdf files. Encode the url?

江枫思渺然 提交于 2019-12-01 21:18:01
I am wishing to extract the content of pdf files available online using PDFMiner . My code is based on the one available in the documentation used to extract the content of PDF files on the hard disk: # Open a PDF file. fp = open('mypdf.pdf', 'rb') # Create a PDF parser object associated with the file object. parser = PDFParser(fp) # Create a PDF document object that stores the document structure. document = PDFDocument(parser) That works quite well with some small changes. Now, I have tried urllib2.openurl for online PDFs but that doesn't work. I get an error message : coercing to Unicode: