PHP Detect Duplicate Text

后端 未结 9 1347
夕颜
夕颜 2021-02-05 07:52

I have a site where users can put in a description about themselves.

Most users write something appropriate but some just copy/paste the same text a number of times (to

相关标签:
9条回答
  • 2021-02-05 08:11

    Here's a the code of the function you're looking for in the description:

    <?php
    function duplicate(){
        $txt = strtolower("Love a and peace love a and peace love a and peace love a and peace love a and peace love a and peace");
        $strings = explode(" ",$txt);
        $set = 2 ;
        for($i=0;$i < sizeof($strings);$i++){
            $count = 0;
            $current = $strings[$i];
            for($j=$i+1;$j < sizeof($strings);$j++){
                if($strings[$j]!==$current){
                    continue;
                }else if($count<$set){
                    $count++;
                }else{
                    echo ("String ".$current." repeated more than ".$set." times\n");
                }
            }
        }
    }
    echo("Hello World!\n");
    duplicate();
    ?>
    
    0 讨论(0)
  • 2021-02-05 08:22
    // 3 examples of how you might detect repeating user input
    
    // use preg_match
    
    // pattern to match agains
    $pattern = '/^text goes here$/';
    
    // the user input
    $input = 'text goes here';
    
    // check if its match
    $repeats = preg_match($pattern, $input);
    
    if ($repeats) {
        var_dump($repeats);
    } else {
        // do something else
    }
    
    // use strpos
    
    $string = 'text goes here';
    $input = 'text goes here';
    $repeats = strpos($string, $input);
    
    if ($repeats !== false) {
        # code...
        var_dump($repeats);
    } else {
        // do something else
    }
    
    // or you could do something like:
    function repeatingWords($str)
    {
        $words = explode(' ', trim($str));  //Trim to prevent any extra blank
        if (count(array_unique($words)) == count($words)) {
            return true; //Same amount of words
        }
    
        return false;
    }
    
    $string = 'text goes here. text goes here. ';
    
    if (repeatingWords($string)) {
        var_dump($string);
    } else {
        // do something else
    }
    
    0 讨论(0)
  • 2021-02-05 08:23

    Another idea would be to use substr_count iteration:

    $str = "Love a and peace love a and peace love a and peace love a and peace love a and peace love a and peace";
    
    $rep = "";
    
    $str = strtolower($str);
    for($i=0,$len=strlen($str),$pattern=""; $i<$len; ++$i) {
      $pattern.= $str[$i];
      if(substr_count($str,$pattern)>1)
        $rep = strlen($rep)<strlen($pattern) ? $pattern : $rep;
      else
        $pattern = "";
    }
    
    // warn if 20%+ of the string is repetitive
    if(strlen($rep)>strlen($str)/5) echo "Repetitive string alert!";
    else echo "String seems to be non-repetitive.";
    
    echo " Longest pattern found: '$rep'";
    

    Which would output

    Repetitive string alert! Longest pattern found: 'love a and peace love a and peace love a and peace'
    
    0 讨论(0)
  • 2021-02-05 08:26

    I am not sure whether it is a good idea to combat such problem. If a person wants to put junk in aboutme field, they will always come up with the idea how to do it. But I will ignore this fact and combat the problem as an algorithmic challenge:

    Having a string S, which consists of the substrings (which can appear many times and non-overlapping) find the substring it consist of.

    The definition is louse and I assume that the string is already converted to lowercase.

    First an easier way:


    Use modification of a longest common subsequence which has an easy DP programming solution. But instead of finding a subsequence in two different sequences, you can find longest common subsequence of the string with respect to the same string LCS(s, s).

    It sounds stupid at the beginning (surely LCS(s, s) == s), but we actually do not care about the answer, we care about the DP matrix that it get.

    Let's look at the example: s = "abcabcabc" and the matrix is:

    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    [0, 1, 0, 0, 1, 0, 0, 1, 0, 0]
    [0, 0, 2, 0, 0, 2, 0, 0, 2, 0]
    [0, 0, 0, 3, 0, 0, 3, 0, 0, 3]
    [0, 1, 0, 0, 4, 0, 0, 4, 0, 0]
    [0, 0, 2, 0, 0, 5, 0, 0, 5, 0]
    [0, 0, 0, 3, 0, 0, 6, 0, 0, 6]
    [0, 1, 0, 0, 4, 0, 0, 7, 0, 0]
    [0, 0, 2, 0, 0, 5, 0, 0, 8, 0]
    [0, 0, 0, 3, 0, 0, 6, 0, 0, 9]
    

    Note the nice diagonals there. As you see the first diagonal ends with 3, second with 6 and third with 9 (our original DP solution which we do not care).

    This is not a coincidence. Hope that after looking in more details about how DP matrix is constructed you can see that these diagonals correspond to duplicate strings.

    Here is an example for s = "aaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtas" and the very last row in the matrix is: [0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 17, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 34, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 51, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 68].

    As you see big numbers (17, 34, 51, 68) there correspond to the end of the diagonals (there is also some noise there just because I specifically added small duplicate letters like aaa).

    Which suggest that we can just find the gcd of biggest two numbers gcd(68, 51) = 17 which will be the length of our repeated substring.

    Here just because we know that the the whole string consists of repeated substrings, we know that it starts at the 0-th position (if we do not know it we would need to find the offset).

    And here we go: the string is "aaabasdfwasfsdtas".

    P.S. this method allows you to find repeats even if they are slightly modified.

    For people who would like to play around here is a python script (which was created in a hustle so feel free to improve):

    def longest_common_substring(s1, s2):
        m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
        longest, x_longest = 0, 0
        for x in xrange(1, 1 + len(s1)):
            for y in xrange(1, 1 + len(s2)):
                if s1[x - 1] == s2[y - 1]:
                    m[x][y] = m[x - 1][y - 1] + 1
                    if m[x][y] > longest:
                        longest = m[x][y]
                else:
                    m[x][y] = 0
        return m
    
    s = "aaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtas"
    m = longest_common_substring(s, s)
    import numpy as np
    import matplotlib.pyplot as plt
    import matplotlib.cm as cm
    M = np.array(m)
    print m[-1]
    arr = np.asarray(M)
    plt.imshow(arr, cmap = cm.Greys_r, interpolation='none')
    plt.show()
    

    I told about the easy way, and forgot to write about the hard way. It is getting late, so I will just explain the idea. The implementation is harder and I am not sure whether it will give you better results. But here it is:

    Use the algorithm for longest repeated substring (you will need to implement trie or suffix tree which is not easy in php).

    After this:

    s = "aaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtas"
    s1 = largest_substring_algo1(s)
    

    Took the implementation of largest_substring_algo1 from here. Actually it is not the best (just for showing the idea) as it does not use the above-mention data-structures. The results for s and s1 are:

    aaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtas
    aaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtasaa
    

    As you see the difference between them is actually the substring which was duplicated.

    0 讨论(0)
  • 2021-02-05 08:28

    I think you are on the right track breaking down the string and looking at repeated words.

    Here is some code though which does not use a PCRE and leverages PHP native string functions (str_word_count and array_count_values):

    <?php
        $words = str_word_count("Love a and peace love a and peace love a and peace love a and peace love a and peace love a and peace", 1);
        $words = array_count_values($words);
    
      var_dump($words);
      /*
      array(5) {
        ["Love"]=>
        int(1)
        ["a"]=>
        int(6)
        ["and"]=>
        int(6)
        ["peace"]=>
        int(6)
        ["love"]=>
        int(5)
      }
      */
    

    Some tweaks might be to:

    • setup a list of common words to be ignored
    • look at order of words (previous and next), not just number of occurrences
    0 讨论(0)
  • 2021-02-05 08:30

    I think the approach of finding duplicate words, will be messy. Most likely you'll get duplicate words in real descriptions "I really, really, really, like ice creme, especially vanilla ice creme".

    A better approach, is to split the string to get the words, find all the unique words, add all the character counts of the unique words, and set that too some limit. Say, you require 100 character descriptions, require around 60 unique characters from words.

    Copying @ficuscr's approach

    $words = str_word_count("Love a and peace love a and peace love a and peace love a and peace love a and peace love a and peace", 1);
    $total = 0;
    foreach ($words as $key => $count) { $total += strlen($key) }
    
    0 讨论(0)
提交回复
热议问题