lxml.html | 易学教程

Scraping new ESPN site using xpath [Python]

阅读更多关于 Scraping new ESPN site using xpath [Python]

问题 I am trying to scrape the new ESPN NBA scoreboard. Here is a simple script which should return the start times for all games on 4/5/15: import requests import lxml.html from lxml.cssselect import CSSSelector doc = lxml.html.fromstring(requests.get('http://scores.espn.go.com/nba/scoreboard?date=20150405').text) #xpath print doc.xpath("//title/text()") #print page title print doc.xpath("//span/@time") print doc.xpath("//span[@class='time']") print doc.xpath("//span[@class='time']/text()") #CCS

Scraping new ESPN site using xpath [Python]

阅读更多关于 Scraping new ESPN site using xpath [Python]

I am trying to scrape the new ESPN NBA scoreboard. Here is a simple script which should return the start times for all games on 4/5/15: import requests import lxml.html from lxml.cssselect import CSSSelector doc = lxml.html.fromstring(requests.get('http://scores.espn.go.com/nba/scoreboard?date=20150405').text) #xpath print doc.xpath("//title/text()") #print page title print doc.xpath("//span/@time") print doc.xpath("//span[@class='time']") print doc.xpath("//span[@class='time']/text()") #CCS Selector sel = CSSSelector('span.time') for i in sel(doc): print i.text It doesn't return anything, but

Extending CSS selectors in BeautifulSoup

阅读更多关于 Extending CSS selectors in BeautifulSoup

The Question: BeautifulSoup provides a very limited support for CSS selectors . For instance, the only supported pseudo-class is nth-of-type and it can only accept numerical values - arguments like even or odd are not allowed. Is it possible to extend BeautifulSoup CSS selectors or let it use lxml.cssselect internally as an underlying CSS selection mechanism? Let's take a look at an example problem/use case . Locate only even rows in the following HTML: <table> <tr> <td>1</td> <tr> <td>2</td> </tr> <tr> <td>3</td> </tr> <tr> <td>4</td> </tr> </table> In lxml.html and lxml.cssselect , it is

Why am I getting this ImportError?

阅读更多关于 Why am I getting this ImportError?

I have a tkinter app that I am compiling to an .exe via py2exe . In the setup file, I have set it to include lxml , urllib , lxml.html , ast , and math . When I run python setup.py py2exe in a CMD console, it compiles fine. I then go to the dist folder It has created, and run the .exe file. When I run the .exe I get this popup window. (source: gyazo.com ) I then procede to open the Trader.exe.log file, and the the contents say the following; Traceback (most recent call last): File "Trader.py", line 1, in <module> File "lxml\html\__init__.pyc", line 42, in <module> File "lxml\etree.pyc", line

lxml.html. Error reading file; Failed to load external entity

阅读更多关于 lxml.html. Error reading file; Failed to load external entity

I am trying to get a movie trailer url from YouTube using parsing with lxml.html: from lxml import html import lxml.html from lxml.etree import XPath def get_youtube_trailer(selected_movie): # Create the url for the YouTube query in order to find the movie trailer title = selected_movie t = {'search_query' : title + ' movie trailer'} query_youtube = urllib.urlencode(t) search_url_youtube = 'https://www.youtube.com/results?' + query_youtube # Define the XPath for the YouTube movie trailer link movie_trailer_xpath = XPath('//ol[@class="item-section"]/li[1]/div/div/div[2]/h3/a/@href') # Parse the

lxml.html. Error reading file; Failed to load external entity

阅读更多关于 lxml.html. Error reading file; Failed to load external entity

问题 I am trying to get a movie trailer url from YouTube using parsing with lxml.html: from lxml import html import lxml.html from lxml.etree import XPath def get_youtube_trailer(selected_movie): # Create the url for the YouTube query in order to find the movie trailer title = selected_movie t = {'search_query' : title + ' movie trailer'} query_youtube = urllib.urlencode(t) search_url_youtube = 'https://www.youtube.com/results?' + query_youtube # Define the XPath for the YouTube movie trailer link

How can I preserve <br> as newlines with lxml.html text_content() or equivalent?

阅读更多关于 How can I preserve as newlines with lxml.html text_content() or equivalent?

I want to preserve <br> tags as \n when extracting the text content from lxml elements. Example code: fragment = '<div>This is a text node.<br/>This is another text node.<br/><br/><span>And a child element.</span><span>Another child,<br> with two text nodes</span></div>' h = lxml.html.fromstring(fragment) Output: > h.text_content() 'This is a text node.This is another text node.And a child element.Another child, with two text nodes' Prepending an \n character to the tail of each <br /> element should give the result you're expecting: >>> import lxml.html as html >>> fragment = '<div>This is a