PHP Detect Duplicate Text

后端 未结 9 1358
夕颜
夕颜 2021-02-05 07:52

I have a site where users can put in a description about themselves.

Most users write something appropriate but some just copy/paste the same text a number of times (to

9条回答
  •  梦毁少年i
    2021-02-05 08:35

    You have a tricky problem on your hands, primarily because your requirements are somewhat unclear.

    You indicate you want to disallow repeated text, because it's "bad".

    Consider someone with who puts the last stanza of Robert Frosts Stopping by Woods on a Snowy Evening in their profile:

    These woods are lovely, dark and deep
    but I have promises to keep
    and miles to go before I sleep
    and miles to go before I sleep
    

    You might consider this good, but it does have a repetition. So what's good, and what's bad? (note that this is not an implementation problem just yet, you're just looking for a way to define "bad repetitions")

    Directly detecting duplicates thus proves tricky. So let's devolve to tricks.

    Compression works by taking redundant data, and compressing it into something smaller. A very repetitive text would be very easily compressed. A trick you could perform, is to take the text, zip it, and take a look at the compression ratio. Then tweak the allowed ratio to something you find acceptable.

    implementation:

    $THRESHOLD = ???;
    $bio = ???;
    $zippedbio = gzencode($bio);
    $compression_ratio = strlen($zippedbio) / strlen($bio);
    if ($compression_ratio >= $THRESHOLD) {
      //ok;
    } else {
      //not ok;
    }
    

    A couple of experimental results from examples found in this question/answers:

    • "Love a and peace love a and peace love a and peace love a and peace love a and peace love a and peace": 0.3960396039604
    • "These woods are lovely, dark and deep but I have promises to keep and miles to go before I sleep and miles to go before I sleep": 0.78461538461538
    • "aaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtas": 0.58823529411765

    suggest a threshold value of around 0.6 before rejecting it as too repetitive.

提交回复
热议问题