Calculating similarity measure between millions of documents

左心房为你撑大大i 提交于 2019-12-25 18:24:05

问题


I have millions of documents(close to 100 million), each document has fields such as skills, hobbies, certification and education. I want to find similarity between each document along with a score.

Below is an example of data.

skills  hobbies        certification    education
Java    fishing        PMP              MS
Python  reading novel  SCM              BS
C#      video game     PMP              B.Tech.
C++     fishing        PMP              MS

so what i want is similarity between first row and all other rows, similarity between second row and all other rows and so on. So, every document should be compared against every other document. to get the similarity scores.

Purpose is that i query my database to get people based on skills. In addition to that, i now want people who even though do not have the skills, but are somewhat matching with the people with the specific skills. For example, if i wanted to get data for people who have JAVA skills, first row will appear and again, last row will appear as it is same with first row based on similarity score.

Challenge: My primary challenge is to compute some similarity score for each document against every other document as you can see from below pseudo code. How can i do this faster? Is there any different way to do this with this pseudo code or is there any other computational(hardware/algorithm) approach to do this faster?

document = all_document_in_db
For i in document:
   for j in document:
      if i != j :
        compute_similarity(i,j)

回答1:


One way to speed up would be to ensure you don't calculate similarity both ways. your current pseudocode will compare i to j and j to i. instead of iterating j over the whole document, iterate over document[i+1:], i.e. only entries after i. This will reduce your calls to compute_similarity by half.

The most suitable data structure for this kind of comparison would be an adjacency matrix. This will be an n * n matrix (n is the number of members in your data set), where matrix[i][j] is the similarity between members i and j. You can populate this matrix fully while still only half-iterating over j, by just simultaneously assigning matrix[i][j] and matrix[j][i] with one call to compute_similarity.

Beyond this, I can't think of any way to speed up this process; you will need to make at least n * (n - 1) / 2 calls to compute_similarity. Think of it like a handshake problem; if every member must be compared to ('shake hands with') every other member at least once, then the lower bound is n * (n - 1) / 2. But I welcome other input!




回答2:


I think what you want is some sort of clustering algorithm. You think of each row of your data as giving a point in a multi-dimensional space. You then want to look for other 'points' that are nearby. Not all the dimensions of your data will produce good clusters so you want to analyze your data for which dimensions will be significant for generation of clusters and reduce the complexity of looking for similar records by mapping to a lower dimension of the data. scikit-learn has some good routines for dimensional analysis and clustering as well as some of the best documentation for helping you to decide which routines to apply to your data. For actually doing the analysis I think you might do well to purchase cloud time with AWS or Google AppEngine. I believe both can give you access to Hadoop clusters with Anaconda (which includes scikit-learn) available on the nodes. Detailed instructions on either of these topics (clustering, cloud computing) are beyond a simple answer. When you get stuck post another question.




回答3:


With 100 mln document, you need 500,000 bln comparisons. No, you cannot do this in Python.

The most feasible solution (aside from using a supercomputer) is to calculate the similarity scores in C/C++.

  1. Read the whole database and enumerate each skill, hobby, certification, and education. This operation takes a linear time, assuming that your index look-ups are "smart" and take constant time.
  2. Create a C/C++ struct with four numeric fields: skill, hobby, certification, and education.
  3. Run a nested loop that subtracts each struct from all other structs fieldwise and uses bit-level arithmetic to assess the similarity.
  4. Save the results into a file and make them available to the Python program, if necessary.


来源:https://stackoverflow.com/questions/45340471/calculating-similarity-measure-between-millions-of-documents

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!