Similarity algorithm advice, using two dimensional associative array

问题

The main goal of this algorithm is to find similar titles of news articles from different sources of web and group them, let's say above 55.55% similarity.

My current approach of the algorithm consist of following steps:

Feed data from MYSQL database into a two-dimensional array ex. $arrayOne.
Make another copy of that array into ex. $arrayTwo.
Create a clean array which will only contain similar titles and other content ex. $array_smlr.
Loop, foreach $arrayOne article_title check for similarity with $arrayTwo article_title
If similarity of between two titles is above 55% and if the article is not from the same news source (this way I don't check same articles from the same source) add it to $array_smlr
Sort the $array_smlr based on percentages of similarity, this way I end up grouping titles that are similar.

Below is my code for the above tasks mentioned.

$result = mysqli_query($conn,"SELECT id_articles,article_img,article_title,LEFT(article_content , 200),psource, date_fetched FROM project.articles WHERE " . rtrim($values,' or') . " ORDER BY date_fetched DESC LIMIT 70");

$arrayOne=array();
$arrayTwo=array();

while($row = mysqli_fetch_assoc($result)){
    $arrayOne[] = $row;
}
$arrayTwo = $arrayOne;
$array_smlr=array();
foreach ($arrayOne as $rowOne) {
    foreach($arrayTwo as $rowTwo){
        $compare = similar_text($rowOne['article_title'], $rowTwo['article_title'], $p);
        if ( round($p,2) >= 55.50 and $rowOne['psource'] != $rowTwo['psource'] ){
            $data =  array('percentage' => round($p,2), 'article_title' => $rowTwo['article_title'], 'psource' => $rowTwo['psource'], 'id_articles' => $rowTwo['id_articles'], 'date_fetched' =>$rowTwo['date_fetched']);
            $array_smlr[]=$data; 
        }
    }
}
array_multisort($array_smlr);
foreach($array_smlr as $row3){
    echo $row3['percentage'] . $row3['article_title'] . $row3['psource'] . $row3['id_articles'] . $row3['date_fetched'] . "<br><br>";
}

This would work with limited functionality, only if I had two similar titles, but let's say if I had 3 similar titles, it would include duplicated rows of data in $array_smlr.

I would appreciate if you have any suggestions on optimization of this algorithm in order to improve the performance.

Thanks,

回答1:

You don't really need 2 arrays instead of the foreach loop without $key wildcard you can use it with $key and skip the solver when the $key is the same. Then you also avoid dupes.

foreach ($arrayOne as $key => $rowOne) {
   foreach($arrayOne as $ikey => $rowTwo){
      if ($ikey != $key) {
        $compare = similar_text($rowOne['article_title'],$rowTwo['article_title'], $p);
        if ( round($p,2) >= 55.50 and $rowOne['psource'] != $rowTwo['psource'] ){
            $data =  array('percentage' => round($p,2), 'article_title' => $rowTwo['article_title'], 'psource' => $rowTwo['psource'], 'id_articles' => $rowTwo['id_articles'], 'date_fetched' =>$rowTwo['date_fetched']);
            $array_smlr[$rowTwo['id_articles']]=$data; 
        }
    }
}

来源：https://stackoverflow.com/questions/30312819/similarity-algorithm-advice-using-two-dimensional-associative-array

标签

php

arrays

algorithm

similarity