Why does “uniq” count identical words as different?

前端 未结 4 1232
南方客
南方客 2021-01-05 11:02

I want to calculate the frequency of the words from a file, where the words are one by line. The file is really big, so this might be the problem (it counts 300k lines in th

相关标签:
4条回答
  • 2021-01-05 11:46

    Or use "sort -u" which also eliminates duplicates. See here.

    0 讨论(0)
  • 2021-01-05 11:53

    The size of the file has nothing to do with what you're seeing. From the man page of uniq(1):

    Note: 'uniq' does not detect repeated lines unless they are adjacent. You may want to sort the input first, or use 'sort -u' without 'uniq'. Also, comparisons honor the rules specified by 'LC_COLLATE'.`

    So running uniq on

    a
    b
    a
    

    will return:

    a
    b
    a
    
    0 讨论(0)
  • 2021-01-05 12:06

    Try to sort first:

    cat .temp_occ | sort| uniq -c | sort -k1,1nr -k2 > distribution.txt
    
    0 讨论(0)
  • 2021-01-05 12:06

    Is it possible that some of the words have whitespace characters after them? If so you should remove them using something like this:

    cat .temp_occ | tr -d ' ' | uniq -c | sort -k1,1nr -k2 > distribution.txt
    
    0 讨论(0)
提交回复
热议问题