lxml.html | 易学教程

Python Xpath: lxml.etree.XPathEvalError: Invalid predicate

阅读更多关于 Python Xpath: lxml.etree.XPathEvalError: Invalid predicate

问题 I'm trying to learn how to scrape web pages and in the tutorial I'm using the code below is throwing this error: lxml.etree.XPathEvalError: Invalid predicate The website I'm querying is (don't judge me, it was the one used in the training vid :/ ): https://itunes.apple.com/us/app/candy-crush-saga/id553834731 The xpath string that causes the error is here: links = tree.xpath('//div[@class="center-stack"//*/a[@class="name"]/@href') I'm using the LXML and requests libraries. If you need any

Python Xpath: lxml.etree.XPathEvalError: Invalid predicate

阅读更多关于 Python Xpath: lxml.etree.XPathEvalError: Invalid predicate

Make Urllib2 move through pages

阅读更多关于 Make Urllib2 move through pages

问题 I am trying to scrape http://targetstudy.com/school/schools-in-chhattisgarh.html I am usling lxml.html, urllib2 I want somehow, follow all the pages by clicking the next page link and download its source. And make it stop at the last page. The href for next page is ['?recNo=25'] Could someone please advise how to do that, Thanks in advance. Here is my code, import urllib2 import lxml.html import itertools url = "http://targetstudy.com/school/schools-in-chhattisgarh.html" req = urllib2.Request

Why does lxml.html sometimes swallow/remove whitespace instead of preserving it?

阅读更多关于 Why does lxml.html sometimes swallow/remove whitespace instead of preserving it?

问题 Given the following code, one might reasonably expect almost the exact same string of HTML that was fed into lxml to be to spit back out. from lxml import html HTML_TEST_STRING = r""" <pre> abc def ghi jkl mno pqr </pre> """ parser = html.HTMLParser( remove_blank_text=False ) doc = html.fromstring( HTML_TEST_STRING, parser=parser ) print( html_out_string ) Instead, even though everything is contained within a <pre> pre-formatted code

How can one replace an element in lxml?

阅读更多关于 How can one replace an element in lxml?

问题 I have a text that I get (data entered by users of CRM) web service, which returns a "terrifying format". I am filtering with python before using the data, but when it comes to removing line breaks (br) removed me also the texts. The code is as follows: description = ''' <div id="highlight" class="section"> text............... <h1>TITLE</h1> Multiple text <ul> <li>bad layer....</li> </ul> subTitle Text1 <br

Extending CSS selectors in BeautifulSoup

阅读更多关于 Extending CSS selectors in BeautifulSoup

问题 The Question: BeautifulSoup provides a very limited support for CSS selectors . For instance, the only supported pseudo-class is nth-of-type and it can only accept numerical values - arguments like even or odd are not allowed. Is it possible to extend BeautifulSoup CSS selectors or let it use lxml.cssselect internally as an underlying CSS selection mechanism? Let's take a look at an example problem/use case . Locate only even rows in the following HTML: <table> <tr> <td>1</td> <tr> <td>2</td>

Why am I getting this ImportError?

阅读更多关于 Why am I getting this ImportError?

问题 I have a tkinter app that I am compiling to an .exe via py2exe . In the setup file, I have set it to include lxml , urllib , lxml.html , ast , and math . When I run python setup.py py2exe in a CMD console, it compiles fine. I then go to the dist folder It has created, and run the .exe file. When I run the .exe I get this popup window. (source: gyazo.com) I then procede to open the Trader.exe.log file, and the the contents say the following; Traceback (most recent call last): File "Trader.py",

How can I preserve as newlines with lxml.html text_content() or equivalent?

阅读更多关于 How can I preserve as newlines with lxml.html text_content() or equivalent?

问题 I want to preserve tags as \n when extracting the text content from lxml elements. Example code: fragment = '<div>This is a text node. This is another text node. And a child element.Another child, with two text nodes</div>' h = lxml.html.fromstring(fragment) Output: > h.text_content() 'This is a text node.This is another text node.And a child element.Another child, with two text nodes' 回答1: Prepending an \n character to the tail of each

LXML unable to retrieve webpage with error “failed to load HTTP resource”

阅读更多关于 LXML unable to retrieve webpage with error “failed to load HTTP resource”

问题 Hi so I tried opening the link below in a browser and it works but not in the code. The link is actually a combination of a news site and then the extension of the article called from another file url.txt. I tried the code with a normal website (www.google.com) and it works perfectly. import sys import MySQLdb from mechanize import Browser from bs4 import BeautifulSoup, SoupStrainer from nltk import word_tokenize from nltk.tokenize import * import urllib2 import nltk, re, pprint import

Python: Convert Raw String to Bytes String without adding escape chraracters

阅读更多关于 Python: Convert Raw String to Bytes String without adding escape chraracters

问题 I have a string: 'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084' And I want: b'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084' But I keep getting: b'BZh91AY&SYA\\xaf\\x82\\r\\x00\\x00\\x01\\x01\\x80\\x02\\xc0\\x02\\x00 \\x00!\\x9ah3M\\x07<]\\xc9\\x14\\xe1BA\\x06\\xbe\\x084' Context I scraped a string off of a webpage and stored it in the variable un . Now I want to decompress it