问题
I would like to extract blocks of texts with more than 100 words from a large HTML page using PHP. Whether the text is contained in <p>...</p>
doesn't matter. I only care about the number of words that makes a coherent text block so texts outside of HTML paragraphs should also be taken into consideration.
How can this be done?
回答1:
I use phpQuery. Are you familiar with jQuery? they share the same syntax. You might be concerned about installing a new library, but trust me this library is well worth the extra over head
phpQuery
You can then access it like this:
foreach($doc->find('p') as $element){
$element = pq($element);
echo str_word_count($element->text());
}
回答2:
Use the PHP Simple DOM Parser.
foreach($html->find('p') as $element){
echo str_word_count($element->src);
}
来源:https://stackoverflow.com/questions/5239539/how-to-extract-blocks-of-text-from-a-html-page