urllib | 易学教程

BeautifulSoup not extracting all html

阅读更多关于 BeautifulSoup not extracting all html

问题 We are trying to get product urls from this page of Forever 21's site (http://www.forever21.com/Product/Category.aspx?br=f21&category=dress&pagesize=100&page=1). For some reason, BeautifulSoup is not getting the elements with class "item_pic", even though they are in the site html. We have tried using requests, mechanize, selenium, and are having no luck. All the commented code is from previous attempts to get the html (none of which worked). Here is our code: from bs4 import BeautifulSoup

Verifying HTTPS certificates with urllib.request

阅读更多关于 Verifying HTTPS certificates with urllib.request

问题 I am trying to open an https URL using the urlopen method in Python 3's urllib.request module. It seems to work fine, but the documentation warns that "[i]f neither cafile nor capath is specified, an HTTPS request will not do any verification of the server’s certificate". I am guessing I need to specify one of those parameters if I don't want my program to be vulnerable to man-in-the-middle attacks, problems with revoked certificates, and other vulnerabilities. cafile and capath are supposed

How do I remove a spurious tag in BeautifulSoup

阅读更多关于 How do I remove a spurious tag in BeautifulSoup

问题 I'm pulling text from the Presidential debates. I got to one that has an issue: it errantly turns every mention of the word "debate" into a tag <debate> . Go ahead, search for "Welcome back to the Republican presidential"; notice an obvious word missing? Cool, so BeautifulSoup does a superb job of cleaning up messy HTML and adding closing tags were they should have been. But in this case, that mucks me up, because <debate> is now a child of a <p> and the closing </debate> is added allllll the

QPX Express API from Python

阅读更多关于 QPX Express API from Python

问题 I am trying to use Google's QPX Express API from python. I keep running into a pair of issues in sending the request. At first what I tried is this: url = "https://www.googleapis.com/qpxExpress/v1/trips/search?key=MY_KEY_HERE" values = {"request": {"passengers": {"kind": "qpxexpress#passengerCounts", "adultCount": 1}, "slice": [{"kind": "qpxexpress#sliceInput", "origin": "RDU", "destination": location, "date": dateGo}]}} data = json.dumps(values) req = urllib2.Request(url, data, {'Content

urllib.request: POST data should be bytes, an iterable of bytes, or a file object

阅读更多关于 urllib.request: POST data should be bytes, an iterable of bytes, or a file object

问题 I need to access an HTML website and search that website for images. It might not be that pretty, but I am able to access the website, I just need some guidance on the best way to search for the IMG's. I tried to treat it like a file but I am getting an error saying I need to convert the data to bytes. Let me know what you think. from urllib import request import re website = request.urlopen('https://www.google.com', "rb") html = website.read() hand = html.decode("UTF-8") for line in hand:

unable to send data using urllib and urllib2 (python)

阅读更多关于 unable to send data using urllib and urllib2 (python)

问题 Hello everybody (first post here). I am trying to send data to a webpage. This webpage request two fields (a file and an e-mail address) if everything is ok the webpage returns a page saying "everything is ok" and sends a file to the provided e-mail address. I execute the code below and I get nothing in my e-mail account. import urllib, urllib2 params = urllib.urlencode({'uploaded': open('file'),'email': 'user@domain.com'}) req = urllib2.urlopen('http://webpage.com', params) print req.read()

Python: Need to request only 20 times per minute

阅读更多关于 Python: Need to request only 20 times per minute

问题 I have made a python code that uses a api to request some data, but the api only allows for 20 requests per minute. I am using urllib to request the data. Also I am using a for loop because the data is located in a file: for i in hashfile: hash = i url1 = "https://hashes.org/api.php?act=REQUEST&key="+key+"&hash="+hash print(url1) response = urllib.request.urlopen(url2).read() strr = str(response) if "plain" in strr: parsed_json = json.loads(response.decode("UTF-8")) print(parsed_json[

Python Urllib UrlOpen Read

阅读更多关于 Python Urllib UrlOpen Read

问题 Say I am retrieving a list of Urls from a server using Urllib2 library from Python. I noticed that it took about 5 seconds to get one page and it would take a long time to finish all the pages I want to collect. I am thinking out of those 5 seconds. Most of the time was consumed on the server side and I am wondering could I just start using the threading library. Say 5 threads in this case, then the average time could be dramatically increased. Maybe 1 or 2 seconds in each page. (might make

Urllib's urlopen breaking on some sites (e.g. StackApps api): returns garbage results

阅读更多关于 Urllib's urlopen breaking on some sites (e.g. StackApps api): returns garbage results

问题 I'm using urllib2 's urlopen function to try and get a JSON result from the StackOverflow api. The code I'm using: >>> import urllib2 >>> conn = urllib2.urlopen("http://api.stackoverflow.com/0.8/users/") >>> conn.readline() The result I'm getting: '\x1f\x8b\x08\x00\x00\x00\x00\x00\x04\x00\xed\xbd\x07`\x1cI\x96%&/m\xca{\x7fJ\... I'm fairly new to urllib, but this doesn't seem like the result I should be getting. I've tried it in other places and I get what I expect (the same as visiting the

Python urllib urlopen not working

阅读更多关于 Python urllib urlopen not working

问题 I am just trying to fetch data from a live web by using the urllib module, so I wrote a simple example Here is my code: import urllib sock = urllib.request.urlopen("http://diveintopython.org/") htmlSource = sock.read() sock.close() print (htmlSource) But I got error like: Traceback (most recent call last): File "D:\test.py", line 3, in <module> sock = urllib.request.urlopen("http://diveintopython.org/") AttributeError: 'module' object has no attribute 'request' 回答1: You are reading the wrong