Building a tag cloud with solr

前端 未结 3 1781
滥情空心
滥情空心 2021-02-06 10:47

Dear stackoverflow community :

Given some text, I wish to get the TOP 50 most frequent words in the text, and create a tag cloud out of it, and thus show the gist of wha

相关标签:
3条回答
  • 2021-02-06 10:52

    Here is an article that describes setting up a Tag Cloud - Creating a Tag Cloud with Solr and PHP. While the PHP portion may not be applicable to you, the actual generation of the tag cloud I believe is...

    This article describes a method of creating a text field with a whitespace tokenizer to return individual words and then performing a facet search against this field. I know that you can set facet limits, so in your case you can only get the top 100 results.

    0 讨论(0)
  • 2021-02-06 10:56

    I have come up with a STOPGAP solution : (Im calling a each solr document a "post" for examples sake)

    There is a terms component in Solr, whose purpose seems to be to expose all the indexed terms of any given field. It is mainly used to implement features like auto-complete, and other features that operate at a term level. And it is by default sorted by frequency - the more frequently occurring terms in the field come up first.

    What I have done is created a dynamic field called content_ and indexed each post-set in its own field based on category. This means that there will be hundreds of instances of the dynamic field each containing one post-set, and I can use the terms component on that field to get TOP TERMS for that post-set.

    As a picture :

    content_postSetOne : contains indexed version of a set of posts
    content_postSetTwo : contains indexed version of another set of posts
    content_postSetThree : contains indexed version of a third set of posts
    

    This solution is sort of working for me, and you can easily create a field per Post also if needed. Im also interested in knowing the implications of using dynamic fields like this : Will this be a problem?

    How this is different from the Paige and jPountz answer is :

    1. The term frequency is the count of words in "A" or "A Set of Docs" and not the count of number of docs containing the term.
    2. I can get the top occurring terms from within ONE document, and if needed also from A Set of documents.
    3. I did not use faceting because it primarily gives the frequency in terms of number of docs and not in terms of number of times the word occurred irrespective of which doc.
    0 讨论(0)
  • 2021-02-06 11:09

    If a Lucene document is a comment, you could use faceting to do so. For example, the following request http://solr:port/solr/select?q={!lucene}uniqueKey:(MA147LL/A OR 3007WFP)&facet=true&facet.field=includes&facet.minCount=1&facet.limit=50 would help you build a tag cloud for comments MA147LL/A and 3007WFP.

    However, this approach would :

    • make Solr instantiate an UnInvertedField instance for the includes field, which required memory,
    • count the number of documents which match a term instead of the total number of occurrences of this term.
    0 讨论(0)
提交回复
热议问题