html-content-extraction

RegEx for extracting HTML Image properties

独自空忆成欢 提交于 2020-01-11 13:21:08
问题 I need a RegEx pattern for extracting all the properties of an image tag. As we all know, there are lots of malformed HTML out there, so the pattern has to cover those possibilities. I was looking at this solution https://stackoverflow.com/questions/138313/how-to-extract-img-src-title-and-alt-from-html-using-php but it didn't quite get it all: I come up something like: (alt|title|src|height|width)\s*=\s*["'][\W\w]+?["'] Is there any possibilities I'll be missing or a more efficient simple

Cleaning text string after getting body text using Beautifulsoup

白昼怎懂夜的黑 提交于 2020-01-05 09:34:58
问题 I'm trying to get text from articles on various webpages and write them as clean text documents. I don't want all visible text because that often includes irrelevant links on the side of webpages. I'm using Beautifulsoup to extract the information from pages. But, extra links not just on the side of the page but also those sometimes in the middle of the body text and at the bottom of the articles sometimes make it into the final product. Does anyone know how to deal with the problem of extra

php, get between function improvement - add array support

六眼飞鱼酱① 提交于 2020-01-04 06:13:05
问题 I have a function which extracts the content between 2 strings. I use it to extract specific information between html tags . However it currently works to extract only the first match so I would like to know if it would be possible to improve it in a such way to extract all the matches and provide them in an array .. similar with preg_match_all function . function get_between($content,$start,$end){ $r = explode($start, $content); if (isset($r[1])){ $r = explode($end, $r[1]); return $r[0]; }

In java how to fix HTTP error 416 Requested Range Not Satisfiable? (While downloading web content from a web page)

不问归期 提交于 2020-01-03 05:01:09
问题 I am trying to download the html content of a web page and getting the 416 status. I found one solution which correctly improves the status code as 200 but still not downloading the proper content. I am very close but missing something. Please help. Code with 416 status: public static void main(String[] args) { String URL="http://www.xyzzzzzzz.com.sg/"; HttpClient client = new org.apache.commons.httpclient.HttpClient(); org.apache.commons.httpclient.methods.GetMethod method = new org.apache

How to write a regular expression for html parsing?

一个人想着一个人 提交于 2019-12-29 08:12:10
问题 I'm trying to write a regular expression for my html parser. I want to match a html tag with given attribute (eg. <div> with class="tab news selected" ) that contains one or more <a href> tags. The regexp should match the entire tag (from <div> to </div> ). I always seem to get "memory exhausted" errors - my program probably takes every tag it can find as a matching one. I'm using boost regex libraries. 回答1: You may also find these questions helpful: Can you provide some examples of why it is

How to write a regular expression for html parsing?

↘锁芯ラ 提交于 2019-12-29 08:12:09
问题 I'm trying to write a regular expression for my html parser. I want to match a html tag with given attribute (eg. <div> with class="tab news selected" ) that contains one or more <a href> tags. The regexp should match the entire tag (from <div> to </div> ). I always seem to get "memory exhausted" errors - my program probably takes every tag it can find as a matching one. I'm using boost regex libraries. 回答1: You may also find these questions helpful: Can you provide some examples of why it is

reading web page source code in java Differs from the orginal webpage source code

这一生的挚爱 提交于 2019-12-25 16:44:35
问题 I am trying to implement program to read webpage source code and save it in text file then do some operations in it but the problem when I read web page source code , there is difference between the orginal web page source code and the output of java program web page source code. my program : String inputLine; URL link = new URL("http://www.ammanu.edu.jo/English/Articles/newsArticle.aspx?id=2935"); BufferedReader in = new BufferedReader( new InputStreamReader(link.openStream(),"UTF-8"));

reading web page source code in java Differs from the orginal webpage source code

孤人 提交于 2019-12-25 16:43:48
问题 I am trying to implement program to read webpage source code and save it in text file then do some operations in it but the problem when I read web page source code , there is difference between the orginal web page source code and the output of java program web page source code. my program : String inputLine; URL link = new URL("http://www.ammanu.edu.jo/English/Articles/newsArticle.aspx?id=2935"); BufferedReader in = new BufferedReader( new InputStreamReader(link.openStream(),"UTF-8"));

How to extract blocks of text from a HTML page?

爱⌒轻易说出口 提交于 2019-12-25 01:49:22
问题 I would like to extract blocks of texts with more than 100 words from a large HTML page using PHP. Whether the text is contained in <p>...</p> doesn't matter. I only care about the number of words that makes a coherent text block so texts outside of HTML paragraphs should also be taken into consideration. How can this be done? 回答1: I use phpQuery. Are you familiar with jQuery? they share the same syntax. You might be concerned about installing a new library, but trust me this library is well

Extract a content of a html page in php

守給你的承諾、 提交于 2019-12-23 07:05:31
问题 There is any way to extract the content of a HTML page that starts from <body> and ends with </body> in php. If there can anyone post some sample code. 回答1: You should have a look at the DOMDocument reference. This example reads a html document, creates a DOMDocument and gets the body tag: libxml_use_internal_errors(true); $dom = new DOMDocument; $dom->loadHTMLFile('http://example.com'); libxml_use_internal_errors(false); $body = $dom->getElementsByTagName('body')->item(0); echo $body-