html-parsing | 易学教程

Extracting a specific row of a table by DOMDocument

阅读更多关于 Extracting a specific row of a table by DOMDocument

问题 how can I extract information from a HTML file by using DOMDocument in PHP my HTML page has a source with this part inside this is my third table in the page that I need to work on: <table> <tbody> <tr> <td>A</td> <td>B</td> <td>C</td> <td>D</td> </tr> <tr> <td>1</td> <td>2</td> <td>3</td> <td>4</td> </tr> </tbody> </table> If my use ask me for showing row with B and D how should I extract the first row of this table and print it by using DOMDocument? 回答1: This would do it, it simply grabs

Webscraping Using BeautifulSoup: Retrieving source code of a website

阅读更多关于 Webscraping Using BeautifulSoup: Retrieving source code of a website

问题 Good day! I am currently making a web scraper for Alibaba website. My problem is that the returned source code does not show some parts that I am interested in. The data is there when I checked the source code using the browser, but I can't retrieve it when using BeautifulSoup. Any tips? from bs4 import BeautifulSoup def make_soup(url): try: html = urlopen(url).read() except: return None return BeautifulSoup(html, "lxml") url = "http://www.alibaba.com/Agricultural-Growing-Media_pid144" soup2

Webscraping Using BeautifulSoup: Retrieving source code of a website

阅读更多关于 Webscraping Using BeautifulSoup: Retrieving source code of a website

Webscraping Using BeautifulSoup: Retrieving source code of a website

阅读更多关于 Webscraping Using BeautifulSoup: Retrieving source code of a website

Webscraping Using BeautifulSoup: Retrieving source code of a website

阅读更多关于 Webscraping Using BeautifulSoup: Retrieving source code of a website

Get text with BeautifulSoup CSS Selector

阅读更多关于 Get text with BeautifulSoup CSS Selector

问题 Example HTML <h2 id="name"> ABC <span class="numbers">123</span> <span class="lower">abc</span> </h2> I can get the numbers with something like: soup.select('#name > span.numbers')[0].text How do I get the text ABC using BeautifulSoup and the select function? What about in this case? <div id="name"> <div id="numbers">123</div> ABC </div> 回答1: In the first case, get the previous sibling: soup.select_one('#name > span.numbers').previous_sibling In the second case, get the next sibling: soup

Find Specific Text Within HTML Tag in Python

阅读更多关于 Find Specific Text Within HTML Tag in Python

问题 I've tried a million different ways to parse out the zestimate, but have yet to be successful. here's the html tag with the zestimate info: <span> <span tabindex="0" role="button"> <span class="sc-bGbJRg iiEDXU ds-dashed-underline"> Zestimate <sup>®</sup> </span> </span> : <span>$331,425</span> </span> Honestly I thought this would get me close, but I get an empty list: link = 'https://www.zillow.com/homedetails/1404-Clearwing-Cir-Georgetown-TX-78626/121721750_zpid/' searched_word = '<span

How to convert the html object to string type?

阅读更多关于 How to convert the html object to string type?

问题 I use jQuery method to get some type of html object: var content = $('#cke_ckeditor iframe').contents().find('.cke_show_borders').clone(); Then I want to convert it to string type: console.log(content[0].toString()); but the the result is: [object HTMLBodyElement] How can I turn it into real string? By the way, can I turn the converted html string to the html object? 回答1: I believe you want to use Element.outerHTML: console.log(content.outerHTML) 回答2: I had the same problem. var docString = "

Beautifulsoup 4: Remove comment tag and its content

阅读更多关于 Beautifulsoup 4: Remove comment tag and its content

问题 So the page that I'm scrapping contains these html codes. How do I remove the comment tag  along with its content with bs4 ? <div class="foo"> cat dog sheep goat  </div> 回答1: You can use extract() (solution is based on this answer): PageElement.extract() removes a tag

Append markup string to a tag in BeautifulSoup

阅读更多关于 Append markup string to a tag in BeautifulSoup

问题 Is it possible to set markup as tag content (akin to setting innerHtml in JavaScript)? For the sake of example, let's say I want to add 10 <a> elements to a <div> , but have them separated with a comma: soup = BeautifulSoup(<<some document here>>) a_tags = ["<a>1</a>", "<a>2</a>", ...] # list of strings div = soup.new_tag("div") a_str = ",".join(a_tags) Using div.append(a_str) escapes < and > into < and > , so I end up with <div> <a1> 1 </a> ... </div> BeautifulSoup(a_str) wraps this in <html