urllib

BeautifulSoup not extracting all html

只谈情不闲聊 提交于 2020-01-04 06:26:14
问题 We are trying to get product urls from this page of Forever 21's site (http://www.forever21.com/Product/Category.aspx?br=f21&category=dress&pagesize=100&page=1). For some reason, BeautifulSoup is not getting the elements with class "item_pic", even though they are in the site html. We have tried using requests, mechanize, selenium, and are having no luck. All the commented code is from previous attempts to get the html (none of which worked). Here is our code: from bs4 import BeautifulSoup

Verifying HTTPS certificates with urllib.request

本秂侑毒 提交于 2020-01-03 16:49:27
问题 I am trying to open an https URL using the urlopen method in Python 3's urllib.request module. It seems to work fine, but the documentation warns that "[i]f neither cafile nor capath is specified, an HTTPS request will not do any verification of the server’s certificate". I am guessing I need to specify one of those parameters if I don't want my program to be vulnerable to man-in-the-middle attacks, problems with revoked certificates, and other vulnerabilities. cafile and capath are supposed

How do I remove a spurious tag in BeautifulSoup

为君一笑 提交于 2020-01-03 05:05:38
问题 I'm pulling text from the Presidential debates. I got to one that has an issue: it errantly turns every mention of the word "debate" into a tag <debate> . Go ahead, search for "Welcome back to the Republican presidential"; notice an obvious word missing? Cool, so BeautifulSoup does a superb job of cleaning up messy HTML and adding closing tags were they should have been. But in this case, that mucks me up, because <debate> is now a child of a <p> and the closing </debate> is added allllll the

QPX Express API from Python

一个人想着一个人 提交于 2020-01-03 03:20:14
问题 I am trying to use Google's QPX Express API from python. I keep running into a pair of issues in sending the request. At first what I tried is this: url = "https://www.googleapis.com/qpxExpress/v1/trips/search?key=MY_KEY_HERE" values = {"request": {"passengers": {"kind": "qpxexpress#passengerCounts", "adultCount": 1}, "slice": [{"kind": "qpxexpress#sliceInput", "origin": "RDU", "destination": location, "date": dateGo}]}} data = json.dumps(values) req = urllib2.Request(url, data, {'Content

urllib.request: POST data should be bytes, an iterable of bytes, or a file object

浪子不回头ぞ 提交于 2020-01-03 02:47:28
问题 I need to access an HTML website and search that website for images. It might not be that pretty, but I am able to access the website, I just need some guidance on the best way to search for the IMG's. I tried to treat it like a file but I am getting an error saying I need to convert the data to bytes. Let me know what you think. from urllib import request import re website = request.urlopen('https://www.google.com', "rb") html = website.read() hand = html.decode("UTF-8") for line in hand:

unable to send data using urllib and urllib2 (python)

吃可爱长大的小学妹 提交于 2020-01-03 02:01:04
问题 Hello everybody (first post here). I am trying to send data to a webpage. This webpage request two fields (a file and an e-mail address) if everything is ok the webpage returns a page saying "everything is ok" and sends a file to the provided e-mail address. I execute the code below and I get nothing in my e-mail account. import urllib, urllib2 params = urllib.urlencode({'uploaded': open('file'),'email': 'user@domain.com'}) req = urllib2.urlopen('http://webpage.com', params) print req.read()

Python: Need to request only 20 times per minute

試著忘記壹切 提交于 2020-01-02 20:43:12
问题 I have made a python code that uses a api to request some data, but the api only allows for 20 requests per minute. I am using urllib to request the data. Also I am using a for loop because the data is located in a file: for i in hashfile: hash = i url1 = "https://hashes.org/api.php?act=REQUEST&key="+key+"&hash="+hash print(url1) response = urllib.request.urlopen(url2).read() strr = str(response) if "plain" in strr: parsed_json = json.loads(response.decode("UTF-8")) print(parsed_json[

Python Urllib UrlOpen Read

谁说胖子不能爱 提交于 2020-01-02 09:39:31
问题 Say I am retrieving a list of Urls from a server using Urllib2 library from Python. I noticed that it took about 5 seconds to get one page and it would take a long time to finish all the pages I want to collect. I am thinking out of those 5 seconds. Most of the time was consumed on the server side and I am wondering could I just start using the threading library. Say 5 threads in this case, then the average time could be dramatically increased. Maybe 1 or 2 seconds in each page. (might make

Urllib's urlopen breaking on some sites (e.g. StackApps api): returns garbage results

只愿长相守 提交于 2020-01-02 01:46:32
问题 I'm using urllib2 's urlopen function to try and get a JSON result from the StackOverflow api. The code I'm using: >>> import urllib2 >>> conn = urllib2.urlopen("http://api.stackoverflow.com/0.8/users/") >>> conn.readline() The result I'm getting: '\x1f\x8b\x08\x00\x00\x00\x00\x00\x04\x00\xed\xbd\x07`\x1cI\x96%&/m\xca{\x7fJ\... I'm fairly new to urllib, but this doesn't seem like the result I should be getting. I've tried it in other places and I get what I expect (the same as visiting the

Python urllib urlopen not working

帅比萌擦擦* 提交于 2020-01-02 01:15:10
问题 I am just trying to fetch data from a live web by using the urllib module, so I wrote a simple example Here is my code: import urllib sock = urllib.request.urlopen("http://diveintopython.org/") htmlSource = sock.read() sock.close() print (htmlSource) But I got error like: Traceback (most recent call last): File "D:\test.py", line 3, in <module> sock = urllib.request.urlopen("http://diveintopython.org/") AttributeError: 'module' object has no attribute 'request' 回答1: You are reading the wrong