Dear stackoverflow community :
Given some text, I wish to get the TOP 50 most frequent words in the text, and create a tag cloud out of it, and thus show the gist of wha
Here is an article that describes setting up a Tag Cloud - Creating a Tag Cloud with Solr and PHP. While the PHP portion may not be applicable to you, the actual generation of the tag cloud I believe is...
This article describes a method of creating a text field with a whitespace tokenizer to return individual words and then performing a facet search against this field. I know that you can set facet limits, so in your case you can only get the top 100 results.
I have come up with a STOPGAP solution : (Im calling a each solr document a "post" for examples sake)
There is a terms component in Solr, whose purpose seems to be to expose all the indexed terms of any given field. It is mainly used to implement features like auto-complete, and other features that operate at a term level. And it is by default sorted by frequency - the more frequently occurring terms in the field come up first.
What I have done is created a dynamic field called content_
and indexed each post-set in its own field based on category. This means that there will be hundreds of instances of the dynamic field each containing one post-set, and I can use the terms component on that field to get TOP TERMS for that post-set.
As a picture :
content_postSetOne : contains indexed version of a set of posts
content_postSetTwo : contains indexed version of another set of posts
content_postSetThree : contains indexed version of a third set of posts
This solution is sort of working for me, and you can easily create a field per Post also if needed. Im also interested in knowing the implications of using dynamic fields like this : Will this be a problem?
How this is different from the Paige and jPountz answer is :
If a Lucene document is a comment, you could use faceting to do so. For example, the following request http://solr:port/solr/select?q={!lucene}uniqueKey:(MA147LL/A OR 3007WFP)&facet=true&facet.field=includes&facet.minCount=1&facet.limit=50
would help you build a tag cloud for comments MA147LL/A
and 3007WFP
.
However, this approach would :
includes
field, which required memory,