问题
The main goal of this algorithm is to find similar titles of news articles from different sources of web and group them, let's say above 55.55% similarity.
My current approach of the algorithm consist of following steps:
- Feed data from MYSQL database into a two-dimensional array ex. $arrayOne.
- Make another copy of that array into ex. $arrayTwo.
- Create a clean array which will only contain similar titles and other content ex. $array_smlr.
- Loop, foreach $arrayOne article_title check for similarity with $arrayTwo article_title
- If similarity of between two titles is above 55% and if the article is not from the same news source (this way I don't check same articles from the same source) add it to $array_smlr
- Sort the $array_smlr based on percentages of similarity, this way I end up grouping titles that are similar.
Below is my code for the above tasks mentioned.
$result = mysqli_query($conn,"SELECT id_articles,article_img,article_title,LEFT(article_content , 200),psource, date_fetched FROM project.articles WHERE " . rtrim($values,' or') . " ORDER BY date_fetched DESC LIMIT 70");
$arrayOne=array();
$arrayTwo=array();
while($row = mysqli_fetch_assoc($result)){
$arrayOne[] = $row;
}
$arrayTwo = $arrayOne;
$array_smlr=array();
foreach ($arrayOne as $rowOne) {
foreach($arrayTwo as $rowTwo){
$compare = similar_text($rowOne['article_title'], $rowTwo['article_title'], $p);
if ( round($p,2) >= 55.50 and $rowOne['psource'] != $rowTwo['psource'] ){
$data = array('percentage' => round($p,2), 'article_title' => $rowTwo['article_title'], 'psource' => $rowTwo['psource'], 'id_articles' => $rowTwo['id_articles'], 'date_fetched' =>$rowTwo['date_fetched']);
$array_smlr[]=$data;
}
}
}
array_multisort($array_smlr);
foreach($array_smlr as $row3){
echo $row3['percentage'] . $row3['article_title'] . $row3['psource'] . $row3['id_articles'] . $row3['date_fetched'] . "<br><br>";
}
This would work with limited functionality, only if I had two similar titles, but let's say if I had 3 similar titles, it would include duplicated rows of data in $array_smlr.
I would appreciate if you have any suggestions on optimization of this algorithm in order to improve the performance.
Thanks,
回答1:
You don't really need 2 arrays instead of the foreach loop without $key wildcard you can use it with $key and skip the solver when the $key is the same. Then you also avoid dupes.
foreach ($arrayOne as $key => $rowOne) {
foreach($arrayOne as $ikey => $rowTwo){
if ($ikey != $key) {
$compare = similar_text($rowOne['article_title'],$rowTwo['article_title'], $p);
if ( round($p,2) >= 55.50 and $rowOne['psource'] != $rowTwo['psource'] ){
$data = array('percentage' => round($p,2), 'article_title' => $rowTwo['article_title'], 'psource' => $rowTwo['psource'], 'id_articles' => $rowTwo['id_articles'], 'date_fetched' =>$rowTwo['date_fetched']);
$array_smlr[$rowTwo['id_articles']]=$data;
}
}
}
来源:https://stackoverflow.com/questions/30312819/similarity-algorithm-advice-using-two-dimensional-associative-array