I am using Gensim for some topic modelling and I have gotten to the point where I am doing similarity queries using the LSI and tf-idf models. I get back the set of IDs and simi
I have just gone through the same process and reached the same point of having "sims" with a document ID but wanting my original "article code". Although it's not provided entirely, there is a metadata feature throughout the Gensim library and the examples which can help. I'll answer this while I remember what I had to do, in case it helps any future visitors to this old question.
See gensim.corpora.textcorpus.TextCorpus#get_texts
, which either returns the text or a simple single item of metadata "linenumber" if the metadata
flag is enabled:
def get_texts(self):
"""Iterate over the collection, yielding one document at a time. A document
is a sequence of words (strings) that can be fed into `Dictionary.doc2bow`.
Each document will be fed through `preprocess_text`. That method should be
overridden to provide different preprocessing steps. This method will need
to be overridden if the metadata you'd like to yield differs from the line
number.
Returns:
generator of lists of tokens (strings); each list corresponds to a preprocessed
document from the corpus `input`.
"""
lines = self.getstream()
if self.metadata:
for lineno, line in enumerate(lines):
yield self.preprocess_text(line), (lineno,)
else:
for line in lines:
yield self.preprocess_text(line)
I had already implemented a custom make_corpus.py script, and a trial classifier script which uses similarity to find related documents to a search document. The changes I made to utilise the metadata from that point were as follows:
In the make_corpus script, I enabled metadata in the constructor to my TextCorpus daughter class:
corpus = SysRevArticleCorpus(inp, lemmatize=lemmatize, metadata=True)
I also needed to serialise the metadata, as I'm not doing the processing immediately after corpus generation (as some of the examples do), so you need to turn on metadata in the serialise step too:
MmCorpus.serialize(outp + '_bow.mm', corpus, progress_cnt=10000, metadata=True)
This makes gensim.matutils.MmWriter#write_corpus
save a “xxx_bow.mm.metadata.cpickle”
file with your corpus .mm
files.
To add more items into the metadata, you need to implement and override a few things in a TextCorpus daughter class. I already had based one off the WikiCorpus example class, as I have my own existing corpus to read.
The constructor needs to receive the metadata flag e.g.:
def __init__(self, fname, processes=None, lemmatize=utils.has_pattern(),
dictionary=None, metadata=False,
...
self.metadata = metadata
if dictionary is None:
# temporarily disable metadata to make internal dict
metadata_setting = self.metadata
self.metadata = False
self.dictionary = Dictionary(self.get_texts())
self.metadata = metadata_setting
else:
self.dictionary = dictionary
I'm actually reading in from a JSON corpus so I'd already written a custom parser. My articles have a "code" property which is my canonical document ID. I also want to store the "title", and the document body is in the "text" property. (This replaces the XML parsing in the wiki example).
def extract_articles(f, filter_namespaces=False):
"""
Extract article from a SYSREV article export JSON = open file-like object `f`.
Return an iterable over (str, str, str) which generates (title, content, pageid) triplets.
"""
elems = (elem for elem in f)
for elem in elems:
yield elem["title"], elem["text"] or "", elem["code"]
This is called from within the overridden get_texts
(in the parent class it mentions you need to override this to use custom metadata). Summarised:
def get_texts(self):
...
with open(self.fname) as data_file:
corpusdata = json.load(data_file)
texts = \
((text, self.lemmatize, title, pageid)
for title, text, pageid
in extract_articles(corpusdata['docs'], self.filter_namespaces))
... (skipping pool processing stuff for clarity)
for tokens, title, pageid in pool.imap(process_article, group):
if self.metadata:
yield (tokens, (pageid, title))
else:
yield tokens
So this should get you saving metadata along side your corpus.mm files. When you want to re-read this in a later script, you will need to read the pickle file back in - there doesn't seem to be any built in methods to re-read the metadata. Fortunately it's just a Dictionary indexed by the Gensim-generated document ID, so it's easy to load and use. (See wiki-sim-search)
e.g. in my trial classifier, I just added two things: metadata = pickle.load()
and metadata[docID]
to finally find the original article.
# re-load everything...
dictionary = corpora.Dictionary.load_from_text(datapath+'/en_wordids.txt')
corpus = corpora.MmCorpus(datapath +'/xxx_bow.mm')
metadata = pickle.load(open(datapath + 'xxx_bow.mm.metadata.cpickle', 'rb'))
lsiModel = models.LsiModel(corpus, id2word=dictionary, num_topics=4)
index = similarities.MatrixSimilarity(lsiModel[corpus])
# example search
doc = "electronic cognitive simulation"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsiModel[vec_bow] # convert the query to LSI space
# perform a similarity query against the corpus
sims = index[vec_lsi]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
# Look up the original article metadata for the top hit
(docID, prob) = sims[0]
print(metadata[docID])
# Prints (CODE, TITLE)
('ShiShani2008ProCarNur', 'Jordanian nurses and physicians learning needs for promoting smoking cessation.')
I know this doesn't provide the original text as you asked (I don't need it myself), but you could very easily add the text to the "metadata" (although this rather stretches the definition of metadata and could be very big!). I guess Gensim presumes you will already have some database of your original documents, and therefore it would be out of scope. However I feel there needs to be a mapping between the Gensim-generated IDs and the original document identifiers, which the metadata feature fulfils quite well.