How to get item id from cosine similarity matrix?

自作多情 提交于 2019-12-06 12:47:40

Create a row index before converting the dataframe to a matrix and create a mapping between the index and the id. After the computation, use the created Map to convert the column index (previously row index but changed with the transpose) to the id.

val rdd = myDataframe.as[(String, org.apache.spark.mllib.linalg.Vector)].rdd.zipWithIndex()
val indexMap = rdd.map{case ((id, vec), index) => (index, id)}.collectAsMap()

Calculate the cosine similarities as before using the :

val irm = new IndexedRowMatrix(rdd.map{case ((id, vec), index) => IndexedRow(index, vec)})
  .toCoordinateMatrix().transpose().toRowMatrix().columnSimilarities()

Convert column indices back to the ids:

irm.entries.map(e => (indexMap(e.i), indexMap(e.j), e.value)) 

This should give you what you are looking for.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!