Calculating the cosine similarity between all the rows of a dataframe in pyspark
问题 I have a dataset containing workers with their demographic information like age gender,address etc and their work locations. I created an RDD from the dataset and converted it into a DataFrame. There are multiple entries for each ID. Hence, I created a DataFrame which contained only the ID of the worker and the various office locations' that he/she had worked. |----------|----------------| | **ID** **Office_Loc** | |----------|----------------| | 1 |Delhi, Mumbai, | | | Gandhinagar | |-------