lxml.html

Python Xpath: lxml.etree.XPathEvalError: Invalid predicate

主宰稳场 提交于 2021-01-26 09:19:07
问题 I'm trying to learn how to scrape web pages and in the tutorial I'm using the code below is throwing this error: lxml.etree.XPathEvalError: Invalid predicate The website I'm querying is (don't judge me, it was the one used in the training vid :/ ): https://itunes.apple.com/us/app/candy-crush-saga/id553834731 The xpath string that causes the error is here: links = tree.xpath('//div[@class="center-stack"//*/a[@class="name"]/@href') I'm using the LXML and requests libraries. If you need any

Python Xpath: lxml.etree.XPathEvalError: Invalid predicate

醉酒当歌 提交于 2021-01-26 09:17:01
问题 I'm trying to learn how to scrape web pages and in the tutorial I'm using the code below is throwing this error: lxml.etree.XPathEvalError: Invalid predicate The website I'm querying is (don't judge me, it was the one used in the training vid :/ ): https://itunes.apple.com/us/app/candy-crush-saga/id553834731 The xpath string that causes the error is here: links = tree.xpath('//div[@class="center-stack"//*/a[@class="name"]/@href') I'm using the LXML and requests libraries. If you need any

Make Urllib2 move through pages

只谈情不闲聊 提交于 2020-01-07 03:42:31
问题 I am trying to scrape http://targetstudy.com/school/schools-in-chhattisgarh.html I am usling lxml.html, urllib2 I want somehow, follow all the pages by clicking the next page link and download its source. And make it stop at the last page. The href for next page is ['?recNo=25'] Could someone please advise how to do that, Thanks in advance. Here is my code, import urllib2 import lxml.html import itertools url = "http://targetstudy.com/school/schools-in-chhattisgarh.html" req = urllib2.Request

Why does lxml.html sometimes swallow/remove whitespace instead of preserving it?

橙三吉。 提交于 2020-01-03 05:27:07
问题 Given the following code, one might reasonably expect almost the exact same string of HTML that was fed into lxml to be to spit back out. from lxml import html HTML_TEST_STRING = r""" <pre> <em>abc</em> <em>def</em> <sub>ghi</sub> <sub>jkl</sub> <em>mno</em> <em>pqr</em> </pre> """ parser = html.HTMLParser( remove_blank_text=False ) doc = html.fromstring( HTML_TEST_STRING, parser=parser ) print( html_out_string ) Instead, even though everything is contained within a <pre> pre-formatted code

How can one replace an element in lxml?

馋奶兔 提交于 2019-12-25 02:55:22
问题 I have a text that I get (data entered by users of CRM) web service, which returns a "terrifying format". I am filtering with python before using the data, but when it comes to removing line breaks (br) removed me also the texts. The code is as follows: description = ''' <div id="highlight" class="section"> <p> text............... </p> <br> <h1>TITLE</h1> <p>Multiple text <br>  </p> <ul> <li>bad layer....</li> </ul> <p> <br>subTitle </p> <p> </p> <p style="text-align: center;"> <br>Text1 <br

Extending CSS selectors in BeautifulSoup

拈花ヽ惹草 提交于 2019-12-19 12:28:11
问题 The Question: BeautifulSoup provides a very limited support for CSS selectors . For instance, the only supported pseudo-class is nth-of-type and it can only accept numerical values - arguments like even or odd are not allowed. Is it possible to extend BeautifulSoup CSS selectors or let it use lxml.cssselect internally as an underlying CSS selection mechanism? Let's take a look at an example problem/use case . Locate only even rows in the following HTML: <table> <tr> <td>1</td> <tr> <td>2</td>

Why am I getting this ImportError?

这一生的挚爱 提交于 2019-12-19 09:11:17
问题 I have a tkinter app that I am compiling to an .exe via py2exe . In the setup file, I have set it to include lxml , urllib , lxml.html , ast , and math . When I run python setup.py py2exe in a CMD console, it compiles fine. I then go to the dist folder It has created, and run the .exe file. When I run the .exe I get this popup window. (source: gyazo.com) I then procede to open the Trader.exe.log file, and the the contents say the following; Traceback (most recent call last): File "Trader.py",

How can I preserve <br> as newlines with lxml.html text_content() or equivalent?

拟墨画扇 提交于 2019-12-17 23:44:56
问题 I want to preserve <br> tags as \n when extracting the text content from lxml elements. Example code: fragment = '<div>This is a text node.<br/>This is another text node.<br/><br/><span>And a child element.</span><span>Another child,<br> with two text nodes</span></div>' h = lxml.html.fromstring(fragment) Output: > h.text_content() 'This is a text node.This is another text node.And a child element.Another child, with two text nodes' 回答1: Prepending an \n character to the tail of each <br />

LXML unable to retrieve webpage with error “failed to load HTTP resource”

北战南征 提交于 2019-12-12 04:12:46
问题 Hi so I tried opening the link below in a browser and it works but not in the code. The link is actually a combination of a news site and then the extension of the article called from another file url.txt. I tried the code with a normal website (www.google.com) and it works perfectly. import sys import MySQLdb from mechanize import Browser from bs4 import BeautifulSoup, SoupStrainer from nltk import word_tokenize from nltk.tokenize import * import urllib2 import nltk, re, pprint import

Python: Convert Raw String to Bytes String without adding escape chraracters

坚强是说给别人听的谎言 提交于 2019-12-08 06:01:34
问题 I have a string: 'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084' And I want: b'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084' But I keep getting: b'BZh91AY&SYA\\xaf\\x82\\r\\x00\\x00\\x01\\x01\\x80\\x02\\xc0\\x02\\x00 \\x00!\\x9ah3M\\x07<]\\xc9\\x14\\xe1BA\\x06\\xbe\\x084' Context I scraped a string off of a webpage and stored it in the variable un . Now I want to decompress it