Counting words on a html web page using php

前端 未结 5 695
难免孤独
难免孤独 2020-12-30 10:40

I need a PHP script which takes a URL of a web page and then echoes how many times a word is mentioned.

Example

This is a generic HTML page:

相关标签:
5条回答
  • 2020-12-30 11:08

    The previous code is a point where start. The next step is delete html tags with the regular expressions. Look for ereg and eregi functions. Some other tricks are required for style and script tags (you have to remove the content) Points and commas have to be removed too...

    0 讨论(0)
  • 2020-12-30 11:08

    This is a complex job that you should not attempt on your own.

    You have to extract text that is not part of tags/comments and is not a child for elements such as script and style. For this, you'll also need a lax HTML parser (like the one implemented in libxml2 and used in DOMDocument.

    Then you have to tokenize the text, which presents its own challenges. Finally, you'd interested in some form of stemming before proceeding to counting the terms.

    I recommend you use specialized tools for this. I haven't used any of these, but you can try HTMLParser for parsing and Lucene for tokenization/stemming (the purpose of Lucene is Text Retrieval, but those operations are necessary for building the index).

    0 讨论(0)
  • 2020-12-30 11:29

    The one line below will do a case insensitive word count after stripping all HTML tags from your string.

    Live Example

    print_r(array_count_values(str_word_count(strip_tags(strtolower($str)), 1)));
    

    To grab the source code of a page you can use cURL or file_get_contents()

    $str = file_get_contents('http://www.example.com/');
    

    From inside out:

    1. Use strtolower() to make everything lower case.
    2. Strip HTML tags using strip_tags()
    3. Create an array of words used using str_word_count(). The argument 1 returns an array containing all the words found inside the string.
    4. Use array_count_values() to capture words used more than once by counting the occurrence of each value in your array of words.
    5. Use print_r() to display the results.
    0 讨论(0)
  • 2020-12-30 11:29

    The below script will read the contents of the remote url, remove the html tags, and count the occurrences of each unique word therein.

    Caveat: In your expected output, "This" has a value of 2, but the below is case-sensitive, so both "this" and "This" are recorded as separate words. You coudl convert the whole input string to lower case before processing if the original case is not significant for your purposes.

    Additionally, as only a basic strip_tags is run on the input, mal-formed tags will not be removed, so the assumption is that your source html is valid.

    Edit: Charlie points out in the comments that things like the head section will still be counted. With the help of a function defined in the user notes of the strip_tags function, these are also now taken care of.

    generichtml.com

    <html>
    <body>
    <h1> This is the title </h1>
    <p> some description text here, <b>this</b> is a word. </p>
    </body>
    </html>
    

    parser.php

    // Fetch remote html
    $contents = file_get_contents($htmlurl);
    
    // Get rid of style, script etc
    $search = array('@<script[^>]*?>.*?</script>@si',  // Strip out javascript
               '@<head>.*?</head>@siU',            // Lose the head section
               '@<style[^>]*?>.*?</style>@siU',    // Strip style tags properly
               '@<![\s\S]*?--[ \t\n\r]*>@'         // Strip multi-line comments including CDATA
    );
    
    $contents = preg_replace($search, '', $contents); 
    
    $result = array_count_values(
                  str_word_count(
                      strip_tags($contents), 1
                      )
                  );
    
    print_r($result);
    

    ?>

    Output:

    Array
    (
        [This] => 1
        [is] => 2
        [the] => 1
        [title] => 1
        [some] => 1
        [description] => 1
        [text] => 1
        [here] => 1
        [this] => 1
        [a] => 1
        [word] => 1
    )
    
    0 讨论(0)
  • 2020-12-30 11:34

    That is my code for counting words containing html tags:

    $sayilacak_metin = str_replace("&nbsp;", " ", $sayilacak_metin);
    $sayilacak_metin = preg_replace("/<([^>]*(<|$))/", "&lt;$1", $sayilacak_metin);
    $sayilacak_metin = strip_tags($sayilacak_metin);
    $sayilacak_metin = str_replace(chr(194)," ",$sayilacak_metin);
    $sayilacak_metin = str_replace(chr(160)," ",$sayilacak_metin);
    $sayilacak_metin = preg_replace(array('/\s{2,}/', '/[\r\t\n]/','/\r/','/\t/','/\n/'), ' ', $sayilacak_metin);
    $sayilacak_metin=trim($sayilacak_metin);
    $parca = explode(" ", $sayilacak_metin);
    $sonuc=count(array_filter($parca));
    
    • Step1: Convert all nbsp to space
    • Step2: Fix broken html tags (If not fixed striptags function will broke string)
    • Step3: Strip html tags
    • Step4&5&6: Clear hidden whitespaces and new line/tabs
    • Step7:Trim beginning and end of string
    • Step8:Convert every word to array
    • Step9:Count Filtered Array
    0 讨论(0)
提交回复
热议问题