lxml.html

Scraping new ESPN site using xpath [Python]

核能气质少年 提交于 2019-12-04 14:38:17
问题 I am trying to scrape the new ESPN NBA scoreboard. Here is a simple script which should return the start times for all games on 4/5/15: import requests import lxml.html from lxml.cssselect import CSSSelector doc = lxml.html.fromstring(requests.get('http://scores.espn.go.com/nba/scoreboard?date=20150405').text) #xpath print doc.xpath("//title/text()") #print page title print doc.xpath("//span/@time") print doc.xpath("//span[@class='time']") print doc.xpath("//span[@class='time']/text()") #CCS

Scraping new ESPN site using xpath [Python]

依然范特西╮ 提交于 2019-12-03 09:08:01
I am trying to scrape the new ESPN NBA scoreboard. Here is a simple script which should return the start times for all games on 4/5/15: import requests import lxml.html from lxml.cssselect import CSSSelector doc = lxml.html.fromstring(requests.get('http://scores.espn.go.com/nba/scoreboard?date=20150405').text) #xpath print doc.xpath("//title/text()") #print page title print doc.xpath("//span/@time") print doc.xpath("//span[@class='time']") print doc.xpath("//span[@class='time']/text()") #CCS Selector sel = CSSSelector('span.time') for i in sel(doc): print i.text It doesn't return anything, but

Extending CSS selectors in BeautifulSoup

♀尐吖头ヾ 提交于 2019-12-01 15:46:14
The Question: BeautifulSoup provides a very limited support for CSS selectors . For instance, the only supported pseudo-class is nth-of-type and it can only accept numerical values - arguments like even or odd are not allowed. Is it possible to extend BeautifulSoup CSS selectors or let it use lxml.cssselect internally as an underlying CSS selection mechanism? Let's take a look at an example problem/use case . Locate only even rows in the following HTML: <table> <tr> <td>1</td> <tr> <td>2</td> </tr> <tr> <td>3</td> </tr> <tr> <td>4</td> </tr> </table> In lxml.html and lxml.cssselect , it is

Why am I getting this ImportError?

被刻印的时光 ゝ 提交于 2019-12-01 07:10:51
I have a tkinter app that I am compiling to an .exe via py2exe . In the setup file, I have set it to include lxml , urllib , lxml.html , ast , and math . When I run python setup.py py2exe in a CMD console, it compiles fine. I then go to the dist folder It has created, and run the .exe file. When I run the .exe I get this popup window. (source: gyazo.com ) I then procede to open the Trader.exe.log file, and the the contents say the following; Traceback (most recent call last): File "Trader.py", line 1, in <module> File "lxml\html\__init__.pyc", line 42, in <module> File "lxml\etree.pyc", line

lxml.html. Error reading file; Failed to load external entity

故事扮演 提交于 2019-11-30 22:40:42
I am trying to get a movie trailer url from YouTube using parsing with lxml.html: from lxml import html import lxml.html from lxml.etree import XPath def get_youtube_trailer(selected_movie): # Create the url for the YouTube query in order to find the movie trailer title = selected_movie t = {'search_query' : title + ' movie trailer'} query_youtube = urllib.urlencode(t) search_url_youtube = 'https://www.youtube.com/results?' + query_youtube # Define the XPath for the YouTube movie trailer link movie_trailer_xpath = XPath('//ol[@class="item-section"]/li[1]/div/div/div[2]/h3/a/@href') # Parse the

lxml.html. Error reading file; Failed to load external entity

风格不统一 提交于 2019-11-30 17:50:24
问题 I am trying to get a movie trailer url from YouTube using parsing with lxml.html: from lxml import html import lxml.html from lxml.etree import XPath def get_youtube_trailer(selected_movie): # Create the url for the YouTube query in order to find the movie trailer title = selected_movie t = {'search_query' : title + ' movie trailer'} query_youtube = urllib.urlencode(t) search_url_youtube = 'https://www.youtube.com/results?' + query_youtube # Define the XPath for the YouTube movie trailer link

How can I preserve <br> as newlines with lxml.html text_content() or equivalent?

邮差的信 提交于 2019-11-28 21:25:20
I want to preserve <br> tags as \n when extracting the text content from lxml elements. Example code: fragment = '<div>This is a text node.<br/>This is another text node.<br/><br/><span>And a child element.</span><span>Another child,<br> with two text nodes</span></div>' h = lxml.html.fromstring(fragment) Output: > h.text_content() 'This is a text node.This is another text node.And a child element.Another child, with two text nodes' Prepending an \n character to the tail of each <br /> element should give the result you're expecting: >>> import lxml.html as html >>> fragment = '<div>This is a