Java实现中文词频统计

匿名 (未验证) 提交于 2019-12-02 20:41:15

昨日有个中文词频统计的需求, 百度一番后, 发现一大堆标题党文章, 讲的与内容严重不符, 这里就简单记录下自己实现的流程吧!

ansj_seg

首先添加依赖:

下载jar
maven
        <dependency>             <groupId>org.ansj</groupId>             <artifactId>ansj_seg</artifactId>             <version>5.1.1</version>         </dependency>基本用法为:
 String str = "欢迎使用ansj_seg,(ansj中文分词)在这里如果你遇到什么问题都可以联系我.我一定尽我所能.帮助大家.ansj_seg更快,更准,更自由!" ;  System.out.println(ToAnalysis.parse(str));    欢迎/v,使用/v,ansj/en,_,seg/en,,,(,ansj/en,中文/nz,分词/n,),在/p,这里/r,如果/c,你/r,遇到/v,什么/r,问题/n,都/d,可以/v,联系/v,我/r,./m,我/r,一定/d,尽我所能/l,./m,帮助/v,大家/r,./m,ansj/en,_,seg/en,更快/d,,,更/d,准/a,,,更/d,自由/a,!

下面就贴上代码:

  public static void wordFrequency() throws IOException {         Map<String, Integer> map = new HashMap<>();          String article = getString();         String result = ToAnalysis.parse(article).toStringWithOutNature();         String[] words = result.split(",");           for(String word: words){             String str = word.trim();             // 过滤空白字符             if (str.equals(""))                 continue;             // 过滤一些高频率的符号             else if(str.matches("[)|(|.|,|。|+|-|“|”|:|?|\\s]"))                 continue;             // 此处过滤长度为1的str             else if (str.length() < 2)                 continue;              if (!map.containsKey(word)){                 map.put(word, 1);             } else {                 int n = map.get(word);                 map.put(word, ++n);             }         }          Iterator<Map.Entry<String, Integer>> iterator = map.entrySet().iterator();         while (iterator.hasNext()){             Map.Entry<String, Integer> entry = iterator.next();             System.out.println(entry.getKey() + ": " + entry.getValue());         }          List<Map.Entry<String, Integer>> list = new ArrayList<>();         Map.Entry<String, Integer> entry;              while ((entry = getMax(map)) != null){             list.add(entry);         }          System.out.println(Arrays.toString(list.toArray()));      }       /**      * 找出map中value最大的entry, 返回此entry, 并在map删除此entry      * @param map      * @return      */     public static Map.Entry<String, Integer> getMax(Map<String, Integer> map){         if (map.size() == 0){             return null;         }         Map.Entry<String, Integer> maxEntry = null;         boolean flag = false;         Iterator<Map.Entry<String, Integer>> iterator = map.entrySet().iterator();         while (iterator.hasNext()){             Map.Entry<String, Integer> entry = iterator.next();             if (!flag){                 maxEntry = entry;                 flag = true;             }             if (entry.getValue() > maxEntry.getValue()){                 maxEntry = entry;             }         }         map.remove(maxEntry.getKey());         return maxEntry;     }      /**      * 从文件中读取待分割的文章素材.   * 文件内容来自简书热门文章: https://www.jianshu.com/p/5b37403f6ba6      * @return      * @throws IOException      */     public static String getString() throws IOException {         FileInputStream inputStream = new FileInputStream(new File("/home/as_/IdeaProjects/SpringMaven/article-txt"));         BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));         StringBuilder strBuilder = new StringBuilder();          String line;         while((line = reader.readLine()) != null){             strBuilder.append(line);         }         reader.close();         inputStream.close();         return strBuilder.toString();     }

最后依旧附上图片:

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!