Mallet basic usage. First steps

问题

I'm trying to use Mallet with literally no expirience in topic modeling and etc. My purpose is to get N topics of M documents that i have right now, classify every document with one or more topic (doc 1 = topic 1; doc 2 = topic 2 and possibly topic 3) and classify with this results new document in future. I tried to use bigartm for this first, but found nothing for classification in this program, only topic modeling. So Mallet, i created a corpus.txt file with following format:

Doc.num. \t(tab) Label(actualy 1 everywhere) \t Text 1 1 some text of document to classify 2 1 another doc text ...

For now I could get topics from this file after turning it to feature sequence format for mallet with

bin/mallet import-file --input corpus.txt --output foo.mallet--keep-sequence

and then get topics from it

bin/mallet train-topics --input foo.mallet --output-state state.gz --output-topic-keys topic-keys.txt --output-doc-topics doc-topics.txt

So general question now is what to use in mallet (train classifier?) to assign every existing document to a topic I found and to save this result to apply to future document that I want to classify with this topics.

Thanks

回答1:

What you're looking for is described as "inference" in Mallet topic models. Training a classifier is a separate package, aimed at directly learning relationships between words and a pre-existing set of classes.

Here are directions for using inference on new documents:

When you train a model with the train-topics command, add the --inferencer-filename [FILENAME] option. This option will create a topic inference tool based on the current, trained model and save it in a file. If you already have a trained model, for example from --output-state or --output-model you can initialize from that state or model, run 0 iterations of sampling, and output an inferencer.

Once you've created the inferencer file, use the MALLET command bin/mallet infer-topics --help to get information on using topic inference.

Note that you must make sure that the new data is compatible with your training data. Otherwise word ID 425 might mean a completely different word. This will make all topics look equally probable. Use the option --use-pipe-from [MALLET TRAINING FILE] in the MALLET command bin/mallet import-file or import-dir to specify a training file.

回答2:

In the setting of your question, the human readable file doc-topics.txt contains the document-topic-matrix, i.e., the composition of every document in the corpus into topics. The topics are indicated by numbers, and their percentages are given. The format of the file is such that the topics are already ordered by their precedence, so the dominant topic comes first.

The file topic-keys.txt contains the first most topical words. You can use this file to guess some good lables for the topics. This labelling is mainly a manual task and it may involve going back to the documents and inspecting some of them to get at really good lables.

来源：https://stackoverflow.com/questions/40585370/mallet-basic-usage-first-steps

标签

topic-modeling

mallet