Why does “uniq” count identical words as different?

前端未结

关注

 4  1237

I want to calculate the frequency of the words from a file, where the words are one by line. The file is really big, so this might be the problem (it counts 300k lines in th

相关标签:

4条回答

遥遥无期

2021-01-05 11:46

Or use "sort -u" which also eliminates duplicates. See here.

0 讨论(0)
发布评论:

提交评论
- 加载中...
小蘑菇

2021-01-05 11:53
The size of the file has nothing to do with what you're seeing. From the man page of uniq(1):

Note: 'uniq' does not detect repeated lines unless they are adjacent. You may want to sort the input first, or use 'sort -u' without 'uniq'. Also, comparisons honor the rules specified by 'LC_COLLATE'.`

So running uniq on
```
a
b
a
```
will return:
```
a
b
a
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
生来不讨喜

2021-01-05 12:06
Try to sort first:
```
cat .temp_occ | sort| uniq -c | sort -k1,1nr -k2 > distribution.txt
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
深忆病人

2021-01-05 12:06
Is it possible that some of the words have whitespace characters after them? If so you should remove them using something like this:
```
cat .temp_occ | tr -d ' ' | uniq -c | sort -k1,1nr -k2 > distribution.txt
```
0 讨论(0)
发布评论:

提交评论
- 加载中...