How to get N most often words in given text, sorted from max to min?

放肆的年华 提交于 2019-12-11 04:06:13

问题


I have been given a large text as input. I have made a HashMap that stores each different word as a key, and number of times that occurs as value (Integer).

Now I have to make a method called mostOften(int k):List that return a List that gives the first k-words that from max number of occurrence to min number of occurrence ( descending order ) using the HashMap that I have made before. The problem is that whenever 2 words have the same number of occurrence, then they should be sorted alphabetically.

The first idea that was on my mind was to swap keys and values of the given HashMap, and put it into TreeMap and TreeMap will sort the words by the key(Integer - number of occurrence of the word ) and then just pop the last/first K-entries from the TreeMap.

But I will have collision for sure, when the number of 2 or 3 words are the same. I will compare the words alphabetically but what Integer should I put as a key of the second word comming.

Any ideas how to implement this, or other options ?


回答1:


Here's the solution with I come up.

  1. First you create a class MyWord that can store the String value of the word and the number of occurences it appears.
  2. You implement the Comparable interface for this class to sort by occurences first and then alphabetically if the number of occurences is the same
  3. Then for the most often method, you create a new List of MyWord from your original map. You add the entries of this to your List
  4. You sort this list
  5. You take the k-first items of this list using subList
  6. You add those Strings to the List<String> and you return it

public class Test {
    public static void main(String [] args){
        Map<String, Integer> m = new HashMap<>();
        m.put("hello",5);
        m.put("halo",5);
        m.put("this",2);
        m.put("that",2);
        m.put("good",1);
        System.out.println(mostOften(m, 3));
    }

    public static List<String> mostOften(Map<String, Integer> m, int k){
        List<MyWord> l = new ArrayList<>();
        for(Map.Entry<String, Integer> entry : m.entrySet())
            l.add(new MyWord(entry.getKey(), entry.getValue()));

        Collections.sort(l);
        List<String> list = new ArrayList<>();
        for(MyWord w : l.subList(0, k))
            list.add(w.word);
        return list;
    }
}

class MyWord implements Comparable<MyWord>{
    public String word;
    public int occurence;

    public MyWord(String word, int occurence) {
        super();
        this.word = word;
        this.occurence = occurence;
    }

    @Override
    public int compareTo(MyWord arg0) {
        int cmp = Integer.compare(arg0.occurence,this.occurence);
        return cmp != 0 ? cmp : word.compareTo(arg0.word);
    }

    @Override
    public int hashCode() {
        final int prime = 31;
        int result = 1;
        result = prime * result + occurence;
        result = prime * result + ((word == null) ? 0 : word.hashCode());
        return result;
    }

    @Override
    public boolean equals(Object obj) {
        if (this == obj)
            return true;
        if (obj == null)
            return false;
        if (getClass() != obj.getClass())
            return false;
        MyWord other = (MyWord) obj;
        if (occurence != other.occurence)
            return false;
        if (word == null) {
            if (other.word != null)
                return false;
        } else if (!word.equals(other.word))
            return false;
        return true;
    }   

}

Output : [halo, hello, that]




回答2:


Hints:

  1. Look at the javadocs for the Collections.sort methods ... both of them!

  2. Look at the javadocs for Map.entries().

  3. Think about how to implement a Comparator that compares instances of a class with two fields, using the 2nd as a "tie breaker" when the other compares as equal.




回答3:


In addition to your Map to store word counts I would use a PriorityQueue of fixed size K (with natural order). It will allow to reach O(N) complexity. Here is a code which use this approach:

In constructor we are reading input stream word by word filling the counters in the Map.

In the same time we are updating priority queue keeping it's max size = K (we need count top K words)

public class TopNWordsCounter
{

public static class WordCount
{
    String word;
    int count;

    public WordCount(String word)
    {
        this.word = word;
        this.count = 1;
    }
}

private PriorityQueue<WordCount> pq;
private Map<String, WordCount> dict;

public TopNWordsCounter(Scanner scanner)
{
    pq = new PriorityQueue<>(10, new Comparator<WordCount>()
    {
        @Override
        public int compare(WordCount o1, WordCount o2)
        {
            return o2.count-o1.count;
        }
    });
    dict = new HashMap<>();

    while (scanner.hasNext())
    {
        String word = scanner.next();

        WordCount wc = dict.get(word);
        if (wc == null)
        {
            wc = new WordCount(word);
            dict.put(word, wc);
        }

        if (pq.contains(wc))
        {
            pq.remove(wc);
            wc.count++;
            pq.add(wc);
        }
        else
        {
            wc.count++;
            if (pq.size() < 10 || wc.count >= pq.peek().count)
            {
                pq.add(wc);
            }
        }

        if (pq.size() > 10)
        {
            pq.poll();
        }
    }
}

public List<String> getTopTenWords()
{
    Stack<String> topTen = new Stack<>();
    while (!pq.isEmpty())
    {
        topTen.add(pq.poll().word);
    }
    return topTen;
}


}


来源:https://stackoverflow.com/questions/20453629/how-to-get-n-most-often-words-in-given-text-sorted-from-max-to-min

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!