short text clustering with large dataset - user profiling

问题

Let me explain what I want to do:

Input

A csv file with millions of rows containing each one of them: id of the user and a string containing the list of keywords used by that user separated by spaces. The format of the second field, the string, is not so important, I can change that based on my needs, for example adding the counts of those keywords. The data comes from the Twitter database: users are Twitter users and keywords are "meaningful" words taken from their tweets (how is not important).

SAMPLE ROW

This is currently what a single row of the csv looks like:
(user id, keywords)

"1627498372", " play house business card"

Goal

Given the input I want to cluster users based on the keywords they use in java so that the different clusters represent somehow users with similar interests, therefore similar keywords usage, without using machine learning techniques, natural language processing or parallelization techniques like MapReduce. I have searched a lot of clustering algorithms libraries on the internet like BIRCH, BFR, CURE, ROCK, CLARANS, etc, but no one of them seems to suit my needs, because either they are for spacial points, or they uses machine learning models, or they struggle with large datasets.

So I am here to ask you if you know of such clustering algorithm names/libraries/reasonably implementable pseudocode (preferably jars) for texts or that can be easily modified to work with strings.

Hope everything is clear.

UPDATE

While I was waiting responses I came upon the scikitlearn library for python, especially minibatchkmeans, I am trying something with it for now... so just as an update, if you find something in python, feel free to share.

回答1:

Instead of clustering (how many clusters? What about users that do not fit any cluster?) you should rather consider frequent itemset mining to find popluar combinations of keywords.

来源：https://stackoverflow.com/questions/52115697/short-text-clustering-with-large-dataset-user-profiling

标签

java

text

cluster-analysis

large-data

user-profile