I need a PHP script which takes a URL of a web page and then echoes how many times a word is mentioned.
This is a generic HTML page:
The previous code is a point where start. The next step is delete html tags with the regular expressions. Look for ereg and eregi functions. Some other tricks are required for style and script tags (you have to remove the content) Points and commas have to be removed too...
This is a complex job that you should not attempt on your own.
You have to extract text that is not part of tags/comments and is not a child for elements such as script
and style
. For this, you'll also need a lax HTML parser (like the one implemented in libxml2 and used in DOMDocument
.
Then you have to tokenize the text, which presents its own challenges. Finally, you'd interested in some form of stemming before proceeding to counting the terms.
I recommend you use specialized tools for this. I haven't used any of these, but you can try HTMLParser for parsing and Lucene for tokenization/stemming (the purpose of Lucene is Text Retrieval, but those operations are necessary for building the index).
The one line below will do a case insensitive word count after stripping all HTML tags from your string.
Live Example
print_r(array_count_values(str_word_count(strip_tags(strtolower($str)), 1)));
To grab the source code of a page you can use cURL or file_get_contents()
$str = file_get_contents('http://www.example.com/');
From inside out:
1
returns an array containing all the words found inside the string.The below script will read the contents of the remote url, remove the html tags, and count the occurrences of each unique word therein.
Caveat: In your expected output, "This" has a value of 2, but the below is case-sensitive, so both "this" and "This" are recorded as separate words. You coudl convert the whole input string to lower case before processing if the original case is not significant for your purposes.
Additionally, as only a basic strip_tags is run on the input, mal-formed tags will not be removed, so the assumption is that your source html is valid.
Edit: Charlie points out in the comments that things like the head
section will still be counted. With the help of a function defined in the user notes of the strip_tags function, these are also now taken care of.
generichtml.com
<html>
<body>
<h1> This is the title </h1>
<p> some description text here, <b>this</b> is a word. </p>
</body>
</html>
parser.php
// Fetch remote html
$contents = file_get_contents($htmlurl);
// Get rid of style, script etc
$search = array('@<script[^>]*?>.*?</script>@si', // Strip out javascript
'@<head>.*?</head>@siU', // Lose the head section
'@<style[^>]*?>.*?</style>@siU', // Strip style tags properly
'@<![\s\S]*?--[ \t\n\r]*>@' // Strip multi-line comments including CDATA
);
$contents = preg_replace($search, '', $contents);
$result = array_count_values(
str_word_count(
strip_tags($contents), 1
)
);
print_r($result);
?>
Output:
Array
(
[This] => 1
[is] => 2
[the] => 1
[title] => 1
[some] => 1
[description] => 1
[text] => 1
[here] => 1
[this] => 1
[a] => 1
[word] => 1
)
That is my code for counting words containing html tags:
$sayilacak_metin = str_replace(" ", " ", $sayilacak_metin);
$sayilacak_metin = preg_replace("/<([^>]*(<|$))/", "<$1", $sayilacak_metin);
$sayilacak_metin = strip_tags($sayilacak_metin);
$sayilacak_metin = str_replace(chr(194)," ",$sayilacak_metin);
$sayilacak_metin = str_replace(chr(160)," ",$sayilacak_metin);
$sayilacak_metin = preg_replace(array('/\s{2,}/', '/[\r\t\n]/','/\r/','/\t/','/\n/'), ' ', $sayilacak_metin);
$sayilacak_metin=trim($sayilacak_metin);
$parca = explode(" ", $sayilacak_metin);
$sonuc=count(array_filter($parca));