html-parsing

Extracting a specific row of a table by DOMDocument

拥有回忆 提交于 2020-08-24 08:27:45
问题 how can I extract information from a HTML file by using DOMDocument in PHP my HTML page has a source with this part inside this is my third table in the page that I need to work on: <table> <tbody> <tr> <td>A</td> <td>B</td> <td>C</td> <td>D</td> </tr> <tr> <td>1</td> <td>2</td> <td>3</td> <td>4</td> </tr> </tbody> </table> If my use ask me for showing row with B and D how should I extract the first row of this table and print it by using DOMDocument? 回答1: This would do it, it simply grabs

Webscraping Using BeautifulSoup: Retrieving source code of a website

我怕爱的太早我们不能终老 提交于 2020-08-24 01:36:18
问题 Good day! I am currently making a web scraper for Alibaba website. My problem is that the returned source code does not show some parts that I am interested in. The data is there when I checked the source code using the browser, but I can't retrieve it when using BeautifulSoup. Any tips? from bs4 import BeautifulSoup def make_soup(url): try: html = urlopen(url).read() except: return None return BeautifulSoup(html, "lxml") url = "http://www.alibaba.com/Agricultural-Growing-Media_pid144" soup2

Webscraping Using BeautifulSoup: Retrieving source code of a website

故事扮演 提交于 2020-08-24 01:35:20
问题 Good day! I am currently making a web scraper for Alibaba website. My problem is that the returned source code does not show some parts that I am interested in. The data is there when I checked the source code using the browser, but I can't retrieve it when using BeautifulSoup. Any tips? from bs4 import BeautifulSoup def make_soup(url): try: html = urlopen(url).read() except: return None return BeautifulSoup(html, "lxml") url = "http://www.alibaba.com/Agricultural-Growing-Media_pid144" soup2

Webscraping Using BeautifulSoup: Retrieving source code of a website

旧街凉风 提交于 2020-08-24 01:33:15
问题 Good day! I am currently making a web scraper for Alibaba website. My problem is that the returned source code does not show some parts that I am interested in. The data is there when I checked the source code using the browser, but I can't retrieve it when using BeautifulSoup. Any tips? from bs4 import BeautifulSoup def make_soup(url): try: html = urlopen(url).read() except: return None return BeautifulSoup(html, "lxml") url = "http://www.alibaba.com/Agricultural-Growing-Media_pid144" soup2

Webscraping Using BeautifulSoup: Retrieving source code of a website

家住魔仙堡 提交于 2020-08-24 01:30:24
问题 Good day! I am currently making a web scraper for Alibaba website. My problem is that the returned source code does not show some parts that I am interested in. The data is there when I checked the source code using the browser, but I can't retrieve it when using BeautifulSoup. Any tips? from bs4 import BeautifulSoup def make_soup(url): try: html = urlopen(url).read() except: return None return BeautifulSoup(html, "lxml") url = "http://www.alibaba.com/Agricultural-Growing-Media_pid144" soup2

Get text with BeautifulSoup CSS Selector

时光毁灭记忆、已成空白 提交于 2020-08-22 05:12:11
问题 Example HTML <h2 id="name"> ABC <span class="numbers">123</span> <span class="lower">abc</span> </h2> I can get the numbers with something like: soup.select('#name > span.numbers')[0].text How do I get the text ABC using BeautifulSoup and the select function? What about in this case? <div id="name"> <div id="numbers">123</div> ABC </div> 回答1: In the first case, get the previous sibling: soup.select_one('#name > span.numbers').previous_sibling In the second case, get the next sibling: soup

Find Specific Text Within HTML Tag in Python

ε祈祈猫儿з 提交于 2020-06-28 09:21:14
问题 I've tried a million different ways to parse out the zestimate, but have yet to be successful. here's the html tag with the zestimate info: <span> <span tabindex="0" role="button"> <span class="sc-bGbJRg iiEDXU ds-dashed-underline"> Zestimate <sup>®</sup> </span> </span> :  <span>$331,425</span> </span> Honestly I thought this would get me close, but I get an empty list: link = 'https://www.zillow.com/homedetails/1404-Clearwing-Cir-Georgetown-TX-78626/121721750_zpid/' searched_word = '<span

How to convert the html object to string type?

社会主义新天地 提交于 2020-06-09 11:53:25
问题 I use jQuery method to get some type of html object: var content = $('#cke_ckeditor iframe').contents().find('.cke_show_borders').clone(); Then I want to convert it to string type: console.log(content[0].toString()); but the the result is: [object HTMLBodyElement] How can I turn it into real string? By the way, can I turn the converted html string to the html object? 回答1: I believe you want to use Element.outerHTML: console.log(content.outerHTML) 回答2: I had the same problem. var docString = "

Beautifulsoup 4: Remove comment tag and its content

夙愿已清 提交于 2020-06-07 21:08:12
问题 So the page that I'm scrapping contains these html codes. How do I remove the comment tag <!-- --> along with its content with bs4 ? <div class="foo"> cat dog sheep goat <!-- <p>NewPP limit report Preprocessor node count: 478/300000 Post‐expand include size: 4852/2097152 bytes Template argument size: 870/2097152 bytes Expensive parser function count: 2/100 ExtLoops count: 6/100 </p> --> </div> 回答1: You can use extract() (solution is based on this answer): PageElement.extract() removes a tag

Append markup string to a tag in BeautifulSoup

孤者浪人 提交于 2020-05-15 03:51:29
问题 Is it possible to set markup as tag content (akin to setting innerHtml in JavaScript)? For the sake of example, let's say I want to add 10 <a> elements to a <div> , but have them separated with a comma: soup = BeautifulSoup(<<some document here>>) a_tags = ["<a>1</a>", "<a>2</a>", ...] # list of strings div = soup.new_tag("div") a_str = ",".join(a_tags) Using div.append(a_str) escapes < and > into < and > , so I end up with <div> <a1> 1 </a> ... </div> BeautifulSoup(a_str) wraps this in <html