问题
Currently i'm studying about data-mining, text comparison and have found this one: https://en.wikipedia.org/wiki/Cosine_similarity.
Since i have successfully implemented this algorithm to compare two strings i have decided to try some more complex task to achieve. I have iterated over my DB which contains about 250k documents and compared one random document from DB to whole documents in that DB.
To compare all these items time was taken: 316.35898590088 sec, that's, - > 5 minutes to compare all 250k documents!
Due this results many issues have arisen and i wan't to ask some suggestions. For clarity first of all i'll describe some details which might be useful.
- As programming language was chosen PHP.
- Documents are stored inMySQL.
- Implementation of cosines similarity function contains only this function, there's no stop words and any other fancy things.
Questions
- Is there any way to achieve some better performance? Where i should start, by tuning algorithm ( i.e. in advance to prepare vectors, etc ), by using other technologies, etc?
- How and where i should store these comparison results. For example i want to print some graphs where i can see all these 250k documents by similarity score so that I can identify which are most similar and so on.
回答1:
Both PHP and MySQL are about the worst choices you could have made.
Efficient cosine similarity is at the heart of Lucene. The key acceleration technique are comoressed inverted indexes. But you really don't want to reimplement them in PHP...
来源:https://stackoverflow.com/questions/31368527/cosines-similarity-on-large-data-sets