short text clustering with large dataset - user profiling

我与影子孤独终老i 提交于 2019-12-13 18:09:56

问题


Let me explain what I want to do:

Input

A csv file with millions of rows containing each one of them: id of the user and a string containing the list of keywords used by that user separated by spaces. The format of the second field, the string, is not so important, I can change that based on my needs, for example adding the counts of those keywords. The data comes from the Twitter database: users are Twitter users and keywords are "meaningful" words taken from their tweets (how is not important).

SAMPLE ROW

This is currently what a single row of the csv looks like:
(user id, keywords)

"1627498372", " play house business card"  

Goal

Given the input I want to cluster users based on the keywords they use in java so that the different clusters represent somehow users with similar interests, therefore similar keywords usage, without using machine learning techniques, natural language processing or parallelization techniques like MapReduce. I have searched a lot of clustering algorithms libraries on the internet like BIRCH, BFR, CURE, ROCK, CLARANS, etc, but no one of them seems to suit my needs, because either they are for spacial points, or they uses machine learning models, or they struggle with large datasets.

So I am here to ask you if you know of such clustering algorithm names/libraries/reasonably implementable pseudocode (preferably jars) for texts or that can be easily modified to work with strings.

Hope everything is clear.

UPDATE

While I was waiting responses I came upon the scikitlearn library for python, especially minibatchkmeans, I am trying something with it for now... so just as an update, if you find something in python, feel free to share.


回答1:


Instead of clustering (how many clusters? What about users that do not fit any cluster?) you should rather consider frequent itemset mining to find popluar combinations of keywords.



来源:https://stackoverflow.com/questions/52115697/short-text-clustering-with-large-dataset-user-profiling

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!