I want to calculate the frequency of the words from a file, where the words are one by line. The file is really big, so this might be the problem (it counts 300k lines in th
Or use "sort -u" which also eliminates duplicates. See here.
The size of the file has nothing to do with what you're seeing. From the man page of uniq(1):
Note: 'uniq' does not detect repeated lines unless they are adjacent. You may want to sort the input first, or use 'sort -u' without 'uniq'. Also, comparisons honor the rules specified by 'LC_COLLATE'.`
So running uniq
on
a
b
a
will return:
a
b
a
Try to sort first:
cat .temp_occ | sort| uniq -c | sort -k1,1nr -k2 > distribution.txt
Is it possible that some of the words have whitespace characters after them? If so you should remove them using something like this:
cat .temp_occ | tr -d ' ' | uniq -c | sort -k1,1nr -k2 > distribution.txt