html-content-extraction

Using java to extract a single value from an html page:

阅读更多关于 Using java to extract a single value from an html page:

问题 I am continuing work on a project that I've been at for some time now, and I have been struggling to pull some data from a website. The website has an iframe that pulls in some data from an unknown source. The data is in the iframe in a tag something like this: <DIV id="number_forecast"><LABEL id="lblDay">9,000</LABEL></DIV> There is a BUNCH of other crap above it but this div id / label is totally unique and is not used anywhere else in the code. 回答1: jsoup is probably what you want, it

Beautifulsoup get value in table

阅读更多关于 Beautifulsoup get value in table

问题 I am trying to scrape http://www.co.jefferson.co.us/ats/displaygeneral.do?sch=000104 and get the "owner Name(s)" What I have works but is really ugly and not the best I am sure, so I am looking for a better way. Here is what I have: soup = BeautifulSoup(url_opener.open(url)) x = soup('table', text = re.compile("Owner Name")) print 'And the owner is', x[0].parent.parent.parent.tr.nextSibling.nextSibling.next.next.next The relevant HTML is <td valign="top"> <table border="1" cellpadding="1"

Is there a way to use readability (text extraction algorithm) and a custom algorithm in python to extract links from text?

阅读更多关于 Is there a way to use readability (text extraction algorithm) and a custom algorithm in python to extract links from text?

问题 Is there a way to use readability (text extraction algorithm) and a custom algorithm in python to extract links from text? I'd like to figure out a way of extracting links that are in the body of text. 1.) I use readability in python https://github.com/gfxmonk/python-readability 2.) I'd like to somehow compare the extracted text to the original html text in order to extract links in the actual body of an article. 回答1: Well, it looks like it returns a BeautifulSoup tree. So you should be able

Unable to show Json html content data in textview in android

阅读更多关于 Unable to show Json html content data in textview in android

问题 Right now i am trying to display images and texts from one html content in text-view in android. Actually i am getting those html contents from json,but the help of below code i can only able to show the available texts like the below image and unable to show the images.Can any one tell me how to display both images and texts from html content? suggestions please Thanks for your precious time!.. String htmlcontent = "\u003ch3 style=\"text-align: justify;\"\u003e \u003cspan style=\"color:

How to extract data from a raw HTML file?

阅读更多关于 How to extract data from a raw HTML file?

问题 Is there a way to extract desired data from a raw html which has been written unsemantically with no IDs and classes ? I mean, suppose there is a saved html file of a webpage (profile) and I want to extract the data like (say) 'hobbies'. Is it possible to do this using PHP? 回答1: Use regex! I kid, I kid. If you know the state of the same page, and the format is guaranteed to remain similar enough, then you can try writing a manual parser. Alternatively, there are a lot of libraries out there

Allowing basic html markup in django

阅读更多关于 Allowing basic html markup in django

问题 Im creating an app that will process user submitted content. I would like to enable users to make their text-based content look pretty with basic html markup i.e < i > < b > < br > . However I do want to prevent them from using things like script tags. Django will auto escape everything therefore it will also disable all safe markup. I can disable this by using: {{ somevar|safe }} or {% autoescape off %} However this will also enable all harmfull script tags. Django does provide the

How do I get content from a table using its ID with a regex?

阅读更多关于 How do I get content from a table using its ID with a regex?

问题 I need to sort a html string so I get the content I need. Now I need to loop through the table rows in a table that have an ID. How do I do this with a regex? 回答1: Regular expressions cannot be used to parse HTML; HTML is not regular. Use a proper HTML parser library. 回答2: It depends on how regular the HTML text is. For example, given this table: <table> <tr><td>1</td><td>Apple</td></tr> <tr><td>2</td><td>Ball</td></tr> <tr><td>3</td><td>Cookie</td></tr> <table> The following regex expression

How do I get content from a table using its ID with a regex?

阅读更多关于 How do I get content from a table using its ID with a regex?

I need to sort a html string so I get the content I need. Now I need to loop through the table rows in a table that have an ID. How do I do this with a regex? Regular expressions cannot be used to parse HTML; HTML is not regular. Use a proper HTML parser library. It depends on how regular the HTML text is. For example, given this table: <table> <tr><td>1</td><td>Apple</td></tr> <tr><td>2</td><td>Ball</td></tr> <tr><td>3</td><td>Cookie</td></tr> <table> The following regex expression finds the IDs in the first column: (?<=<tr><td>).*?(?=</td>) If you run the page through an html-parser like

How do you parse a poorly formatted HTML file?

阅读更多关于 How do you parse a poorly formatted HTML file?

问题 I have to parse a series of web pages in order to import data into an application. Each type of web page provides the same kind of data. The problem is that the HTML of each page is different, so the location of the data varies. Another problem is that the HTML code is poorly formatted, making it impossible to use a XML-like parser. So far, the best strategy I can think of, is to define a template for each kind of page, like: Template A: <html> ... <tr><td>Table column that is missing a td

Extracting the body text of an HTML document using PHP

阅读更多关于 Extracting the body text of an HTML document using PHP

I know it's better to use DOM for this purpose but let's try to extract the text in this way: <?php $html=<<<EOD <html> <head> </head> <body> <p>Some text</p> </body> </html> EOD; preg_match('/<body.*?>/', $html, $matches, PREG_OFFSET_CAPTURE); if (empty($matches)) exit; $matched_body_start_tag = $matches[0][0]; $index_of_body_start_tag = $matches[0][1]; $index_of_body_end_tag = strpos($html, '</body>'); $body = substr( $html, $index_of_body_start_tag + strlen($matched_body_start_tag), $index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag) ); echo $body; The result