问题
I'm trying to use Mallet
with literally no expirience in topic modeling and etc. My purpose is to get N topics of M documents that i have right now, classify every document with one or more topic (doc 1 = topic 1; doc 2 = topic 2 and possibly topic 3) and classify with this results new document in future. I tried to use bigartm
for this first, but found nothing for classification in this program, only topic modeling. So Mallet, i created a corpus.txt file with following format:
Doc.num. \t(tab) Label(actualy 1 everywhere) \t Text
1 1 some text of document to classify
2 1 another doc text
...
For now I could get topics from this file after turning it to feature sequence format for mallet with
bin/mallet import-file --input corpus.txt --output foo.mallet--keep-sequence
and then get topics from it
bin/mallet train-topics --input foo.mallet --output-state state.gz --output-topic-keys topic-keys.txt --output-doc-topics doc-topics.txt
So general question now is what to use in mallet (train classifier?) to assign every existing document to a topic I found and to save this result to apply to future document that I want to classify with this topics.
Thanks
回答1:
What you're looking for is described as "inference" in Mallet topic models. Training a classifier is a separate package, aimed at directly learning relationships between words and a pre-existing set of classes.
Here are directions for using inference on new documents:
When you train a model with the train-topics
command, add the --inferencer-filename [FILENAME]
option. This option will create a topic inference tool based on the current, trained model and save it in a file.
If you already have a trained model, for example from --output-state
or --output-model
you can initialize from that state or model, run 0 iterations of sampling, and output an inferencer.
Once you've created the inferencer file, use the MALLET command bin/mallet infer-topics --help
to get information on using topic inference.
Note that you must make sure that the new data is compatible with your training data. Otherwise word ID 425 might mean a completely different word. This will make all topics look equally probable. Use the option --use-pipe-from [MALLET TRAINING FILE]
in the MALLET command bin/mallet import-file
or import-dir
to specify a training file.
回答2:
In the setting of your question, the human readable file doc-topics.txt
contains the document-topic-matrix, i.e., the composition of every document in the corpus into topics. The topics are indicated by numbers, and their percentages are given. The format of the file is such that the topics are already ordered by their precedence, so the dominant topic comes first.
The file topic-keys.txt
contains the first most topical words. You can use this file to guess some good lables for the topics. This labelling is mainly a manual task and it may involve going back to the documents and inspecting some of them to get at really good lables.
来源:https://stackoverflow.com/questions/40585370/mallet-basic-usage-first-steps