问题
I can remove duplicate entries from small text files, but not large text files.
I have a file that's 4MB.
The beginning of the file looks like this:
aa
aah
aahed
aahed
aahing
aahing
aahs
aahs
aal
aalii
aalii
aaliis
aaliis
...
I want to remove the duplicates.
For example, "aahed" shows up twice, and I would only like it to show up once.
No matter what one-liner I've tried, the big list will not change.
If It type:
sort big_list.txt | uniq | less
I see:
aa
aah
aahed
aahed <-- didn't get rid of it
aahing
aahing <-- didn't get rid of it
aahs
aahs <-- didn't get rid of it
aal
...
However, If I copy a small chunk of words from the top of this text file and re-run the command on the small chunk of data, it does what's expected.
Are these programs refusing to sort because the file is too big? I didn't think 4MB was very big. It doesn't output a warning or anything.
I quickly wrote my own "uniq" program, and it has the same behavior. It works on a small subset of the list, but doesn't do anything to the 4MB text file. What's my issue?
EDIT: Here is a hex dump:
00000000 61 61 0a 61 61 68 0a 61 61 68 65 64 0a 61 61 68 |aa.aah.aahed.aah|
00000010 65 64 0d 0a 61 61 68 69 6e 67 0a 61 61 68 69 6e |ed..aahing.aahin|
00000020 67 0d 0a 61 61 68 73 0a 61 61 68 73 0d 0a 61 61 |g..aahs.aahs..aa|
00000030 6c 0a 61 61 6c 69 69 0a 61 61 6c 69 69 0d 0a 61 |l.aalii.aalii..a|
00000040 61 6c 69 69 73 0a 61 61 6c 69 69 73 0d 0a 61 61 |aliis.aaliis..aa|
61 61 68 65 64 0a
a a h e d \r
61 61 68 65 64 0d
a a h e d \n
Solved: Different line delimiters
回答1:
You can normalize line delimeters (convert CR+LF to LF):
sed 's/\r//' big_list.txt | sort -u
回答2:
The sort(1) command accepts a -u
option for uniqueness of key.
Just use
sort -u big_list.txt
回答3:
To answer max taldykin's question about awk '!_[$0]++' file
:
awk '!_[$0]++' file
is the same as
awk '!seen[$0]++' file
, which is the same as
awk '!seen[$0]++ { print; }' file
, which means
awk '
{
if (!seen[$0]) {
print;
}
seen[$0]++;
}' file
Important points here:
$0
means the current record which usually is the current line- In
awk
, the ACTION part is optional and the default action is{ print; }
- In arithmetic context, an uninitialized var is
0
回答4:
apart from sort -u
you can also use awk '!_[$0]++' yourfile
来源:https://stackoverflow.com/questions/15493774/sort-filename-uniq-does-not-work-on-large-files