PHP Detect Duplicate Text

后端 未结 9 1348
夕颜
夕颜 2021-02-05 07:52

I have a site where users can put in a description about themselves.

Most users write something appropriate but some just copy/paste the same text a number of times (to

相关标签:
9条回答
  • 2021-02-05 08:33

    This is a basic text classification problem. There are lots of articles out there on how to determine if some text is spam/not spam which I'd recommend digging into if you really want to get into the details. A lot of it is probably overkill for what you need to do here.

    Granted one approach would be to evaluate why you're requiring people to enter longer bios, but I'll assume you've already decided that forcing people to enter more text is the way to go.

    Here's an outline of what I would do:

    1. Build a histogram of word occurrences for the input string
    2. Study the histograms of some valid and invalid text
    3. Come up with a formula for classifying a histogram as valid or not

    This approach would require you to figure out what's different between the two sets. Intuitively, I'd expect spam to show fewer unique words and if you plot the histogram values, a higher area under the curve concentrated toward the top words.

    Here's some sample code to get you going:

    $str = 'Love a and peace love a and peace love a and peace love a and peace love a and peace love a and peace';
    
    // Build a histogram mapping words to occurrence counts
    $hist = array();
    
    // Split on any number of consecutive whitespace characters
    foreach (preg_split('/\s+/', $str) as $word)
    {
      // Force all words lowercase to ignore capitalization differences
      $word = strtolower($word);
    
      // Count occurrences of the word
      if (isset($hist[$word]))
      {
        $hist[$word]++;
      }
      else
      {
        $hist[$word] = 1;
      }
    }
    
    // Once you're done, extract only the counts
    $vals = array_values($hist);
    rsort($vals); // Sort max to min
    
    // Now that you have the counts, analyze and decide valid/invalid
    var_dump($vals);
    

    When you run this code on some repetitive strings, you'll see the difference. Here's a plot of the $vals array from the example string you gave:

    Compare that with the first two paragraphs of Martin Luther King Jr.'s bio from Wikipedia:

    A long tail indicates lots of unique words. There's still some repetition, but the general shape shows some variation.

    FYI, PHP has a stats package you can install if you're going to be doing lots of math like standard deviation, distribution modeling, etc.

    0 讨论(0)
  • 2021-02-05 08:35

    You have a tricky problem on your hands, primarily because your requirements are somewhat unclear.

    You indicate you want to disallow repeated text, because it's "bad".

    Consider someone with who puts the last stanza of Robert Frosts Stopping by Woods on a Snowy Evening in their profile:

    These woods are lovely, dark and deep
    but I have promises to keep
    and miles to go before I sleep
    and miles to go before I sleep
    

    You might consider this good, but it does have a repetition. So what's good, and what's bad? (note that this is not an implementation problem just yet, you're just looking for a way to define "bad repetitions")

    Directly detecting duplicates thus proves tricky. So let's devolve to tricks.

    Compression works by taking redundant data, and compressing it into something smaller. A very repetitive text would be very easily compressed. A trick you could perform, is to take the text, zip it, and take a look at the compression ratio. Then tweak the allowed ratio to something you find acceptable.

    implementation:

    $THRESHOLD = ???;
    $bio = ???;
    $zippedbio = gzencode($bio);
    $compression_ratio = strlen($zippedbio) / strlen($bio);
    if ($compression_ratio >= $THRESHOLD) {
      //ok;
    } else {
      //not ok;
    }
    

    A couple of experimental results from examples found in this question/answers:

    • "Love a and peace love a and peace love a and peace love a and peace love a and peace love a and peace": 0.3960396039604
    • "These woods are lovely, dark and deep but I have promises to keep and miles to go before I sleep and miles to go before I sleep": 0.78461538461538
    • "aaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtas": 0.58823529411765

    suggest a threshold value of around 0.6 before rejecting it as too repetitive.

    0 讨论(0)
  • 2021-02-05 08:38

    You could use a regex, like this:

    if (preg_match('/(.{10,})\\1{2,}/', $theText)) {
        echo "The string is repeated.";
    }
    

    Explanation:

    • (.{10,}) looks for and captures a string that is at least 10 characters long
    • \\1{2,} looks for the first string at least 2 more times

    Possible tweaks to suit your needs:

    • Change 10 to a higher or lower number to match longer or shorter repeated strings. I just used 10 as an example.
    • If you want to catch even one repetition (love and peace love and peace), delete the {2,}. If you want to catch a higher number of repetitions, increase the 2.
    • If you don't care how many times the repetition occurs, only that it occurs, delete the , in {2,}.
    0 讨论(0)
提交回复
热议问题