urllib | 易学教程

How to increment alpha numeric values

阅读更多关于 How to increment alpha numeric values

问题 Currently I'm working on a program. I'd like for it to increment a 5 character alpha numeric value. (Sorry if increment is not the correct word.) So I'd like for the program to say start at 55aa0 and end at 99zz9. The reason I'd like for it to start at 55aa0 and not 00aa0 is because for what I'm doing it'd be a waste of time. I'd also like to assign that value to a variable and append it onto it onto the end of another variable and call this one url. So for example the url could be: domain.de

Why does text retrieved from pages sometimes look like gibberish?

阅读更多关于 Why does text retrieved from pages sometimes look like gibberish?

问题 I'm using urllib and urllib2 in Python to open and read webpages but sometimes, the text I get is unreadable. For example, if I run this: import urllib text = urllib.urlopen('http://tagger.steve.museum/steve/object/141913').read() print text I get some unreadable text. I've read these posts: Gibberish from urlopen Does python urllib2 automatically uncompress gzip data fetched from webpage? but can't seem to find my answer. Thank you in advance for your help! UPDATE: I fixed the problem by

Python urllib freezes with specific URL

阅读更多关于 Python urllib freezes with specific URL

问题 I am trying to fetch a page and urlopen hangs and never returns anything, although the web page is very light and can be opened with any browser without any problems import urllib.request with urllib.request.urlopen("http://www.planalto.gov.br/ccivil_03/_Ato2007-2010/2008/Lei/L11882.htm") as response: print(response.read()) This simple code just freezes while retrieving the response, but if you try to open http://www.planalto.gov.br/ccivil_03/_Ato2007-2010/2008/Lei/L11882.htm it opens without

How to get only first class' data between two same classes

阅读更多关于 How to get only first class' data between two same classes

问题 On https://www.hltv.org/matches page, matches divided by dates but the classes are same. I mean, This is today's match class <div class="match-day"><div class="standard-headline">2018-05-01</div> This is tommorow's match class. <div class="match-day"><div class="standard-headline">2018-05-02</div> What i'm trying to do is, I wanna get the links under the "standard-headline" class but only today's matches. Like, getting the only first one. Here is my code. import urllib.request from bs4 import

a (presumably basic) web scraping of http://www.ssa.gov/cgi-bin/popularnames.cgi in urllib

阅读更多关于 a (presumably basic) web scraping of http://www.ssa.gov/cgi-bin/popularnames.cgi in urllib

问题 I am very new to Python (and web scraping). Let me ask you a question. Many website actually do not report its specific URLs in Firefox or other browsers. For example, Social Security Admin shows popular baby names with ranks (since 1880), but the url does not change when I change the year from 1880 to 1881. It is constantly, http://www.ssa.gov/cgi-bin/popularnames.cgi Because I don't know the specific URL, I could not download the webpage using urllib. In this page source, it includes:

Logging into quora using python

阅读更多关于 Logging into quora using python

问题 I tried logging into quora using python. But it gives me the following error. urllib2.HTTPError: HTTP Error 500: Internal Server Error This is my code till now. I also work behind a proxy. import urllib2 import urllib import re import cookielib class Quora: def __init__(self): '''Initialising and authentication''' auth = 'http://name:password@proxy:port' cj = cookielib.CookieJar() logindata = urllib.urlencode({'email' : 'email' , 'password' : 'password'}) handler = urllib2.ProxyHandler({'http

Download file from Blob URL with Python

阅读更多关于 Download file from Blob URL with Python

问题 I wish to have my Python script download the Master data (Download, XLSX) Excel file from this Frankfurt stock exchange webpage. When to retrieve it with urrlib and wget , it turns out that the URL leads to a Blob and the file downloaded is only 289 bytes and unreadable. http://www.xetra.com/blob/1193366/b2f210876702b8e08e40b8ecb769a02e/data/All-tradable-ETFs-ETCs-and-ETNs.xlsx I'm entirely unfamiliar with Blobs and have these questions: Can the file "behind the Blob" be successfully

urllib.error.HTTPError: HTTP Error 403: Forbidden

阅读更多关于 urllib.error.HTTPError: HTTP Error 403: Forbidden

问题 I get the error "urllib.error.HTTPError: HTTP Error 403: Forbidden" when scraping certain pages, and understand that adding something like hdr = {"User-Agent': 'Mozilla/5.0"} to the header is the solution for this. However I can't make it work when the URL's I'm trying to scrape is in a separate source file. How/where can I add the User-Agent to the code below? from bs4 import BeautifulSoup import urllib.request as urllib2 import time list_open = open("source-urls.txt") read_list = list_open

Retrying on Connection Reset

阅读更多关于 Retrying on Connection Reset

问题 I'm using urllib.request to download files from the internet. However sometimes I get Connection Reset by Peer and I want to retry. I tried the following, but it seems that e.errno contains socket error and not an actual errno: while True: try: filename, headers = urllib.request.urlretrieve(url) break except IOError as e: if e.errno != errno.ECONNRESET: raise except Exception as e: raise Any suggestions? 回答1: Well this part is not needed, first of all. except Exception as e: raise And the

Python project little issue I can't seem to figure out to print something

阅读更多关于 Python project little issue I can't seem to figure out to print something

问题 So, I've recently been adventuring around with python, and I've been attempting to learn a bit of things by mixing code that I find and making it into something I could end up using in the future. I've almost completely the project, although when I print out the links, it says https://v3rmillion.net/showthread.php Instead of being something like that I would prefer being: https://v3rmillion.net/showthread.php?tid=393794 import requests,os,urllib,sys, webbrowser, bs4 from bs4 import