remove dups from many csv files

二次信任 提交于 2019-12-13 15:48:24

问题


Given n csv files where they add up to 100 GB in size, I need to remove duplicate rows based on the following rules and conditions:

  • The csv files are numbered 1.csv to n.csv, and each file is about 50MB in size.
  • The first column is a string key, 2 rows are considered dup if their first columns are the same.
  • I want to remove dups by keeping the one in a later file (2.csv is considered later than 1.csv)

My algorithm is the following, I want to know if there's a better one.

  • merge all files into one giant file

    cat *.csv > one.csv
    
  • sort the csv

    sort one.csv >one_sorted.csv
    
  • not sure how to elimiate dups at this point. uniq has a -f flag that skips the first N fields, but in my case I want to skip all but the first 1 fields.

I need help with the last step (eliminating dups in a sorted file). Also is there a more efficient algorithm?


回答1:


If you can keep the lines in memory

If enough of the data will fit in memory, the awk solution by steve is pretty neat, whether you write to the sort command by pipe within awk or simply by piping the output of the unadorned awk to sort at the shell level.

If you have 100 GiB of data with perhaps 3% duplication, then you'll need to be able to store 100 GiB of data in memory. That's a lot of main memory. A 64-bit system might handle it with virtual memory, but it is likely to run rather slowly.

If the keys fit in memory

If you can't fit enough of the data in memory, then the task ahead is much harder and will require at least two scans over the files. We need to assume, pro tem, that you can at least fit each key in memory, along with a count of the number of times the key has appeared.

  1. Scan 1: read the files.
    • Count the number of times each key appears in the input.
    • In awk, use icount[$1]++.
  2. Scan 2: reread the files.
    • Count the number of times each key has appeared; ocount[$1]++.
    • If icount[$1] == ocount[$1], then print the line.

(This assumes you can store the keys and counts twice; the alternative is to use icount (only) in both scans, incrementing in Scan 1 and decrementing in Scan 2, printing the value when the count decrements to zero.)

I'd probably use Perl for this rather than awk, if only because it will be easier to reread the files in Perl than in awk.


Not even the keys fit?

What about if you can't even fit the keys and their counts into memory? Then you are facing some serious problems, not least because scripting languages may not report the out of memory condition to you as cleanly as you'd like. I'm not going to attempt to cross this bridge until it's shown to be necessary. And if it is necessary, we'll need some statistical data on the file sets to know what might be possible:

  • Average length of a record.
  • Number of distinct keys.
  • Number of distinct keys with N occurrences for each of N = 1, 2, ... max.
  • Length of a key.
  • Number of keys plus counts that can be fitted into memory.

And probably some others...so, as I said, let's not try crossing that bridge until it is shown to be necessary.


Perl solution

Example data

$ cat x000.csv
abc,123,def
abd,124,deg
abe,125,deh
$ cat x001.csv
abc,223,xef
bbd,224,xeg
bbe,225,xeh
$ cat x002.csv
cbc,323,zef
cbd,324,zeg
bbe,325,zeh
$ perl fixdupcsv.pl x???.csv
abd,124,deg
abe,125,deh
abc,223,xef
bbd,224,xeg
cbc,323,zef
cbd,324,zeg
bbe,325,zeh
$ 

Note the absence of gigabyte-scale testing!

fixdupcsv.pl

This uses the 'count up, count down' technique.

#!/usr/bin/env perl
#
# Eliminate duplicate records from 100 GiB of CSV files based on key in column 1.

use strict;
use warnings;

# Scan 1 - count occurrences of each key

my %count;
my @ARGS = @ARGV;   # Preserve arguments for Scan 2

while (<>)
{
    $_ =~ /^([^,]+)/;
    $count{$1}++;
}

# Scan 2 - reread the files; count down occurrences of each key.
# Print when it reaches 0.

@ARGV = @ARGS;      # Reset arguments for Scan 2

while (<>)
{
    $_ =~ /^([^,]+)/;
    $count{$1}--;
    print if $count{$1} == 0;
}

The 'while (<>)' notation destroys @ARGV (hence the copy to @ARGS before doing anything else), but that also means that if you reset @ARGV to the original value, it will run through the files a second time. Tested with Perl 5.16.0 and 5.10.0 on Mac OS X 10.7.5.

This is Perl; TMTOWTDI. You could use:

#!/usr/bin/env perl
#
# Eliminate duplicate records from 100 GiB of CSV files based on key in column 1.

use strict;
use warnings;

my %count;

sub counter
{
    my($inc) = @_;
    while (<>)
    {
        $_ =~ /^([^,]+)/;
        $count{$1} += $inc;
        print if $count{$1} == 0;
    }
}

my @ARGS = @ARGV;   # Preserve arguments for Scan 2
counter(+1);
@ARGV = @ARGS;      # Reset arguments for Scan 2
counter(-1);

There are probably ways to compress the body of the loop, too, but I find what's there reasonably clear and prefer clarity over extreme terseness.

Invocation

You need to present the fixdupcsv.pl script with the file names in the correct order. Since you have files numbered from 1.csv through about 2000.csv, it is important not to list them in alphanumeric order. The other answers suggest ls -v *.csv using the GNU ls extension option. If it is available, that's the best choice.

perl fixdupcsv.pl $(ls -v *.csv)

If that isn't available, then you need to do a numeric sort on the names:

perl fixdupcsv.pl $(ls *.csv | sort -t. -k1.1n)

Awk solution

awk -F, '
BEGIN   {
            for (i = 1; i < ARGC; i++)
            {
                while ((getline < ARGV[i]) > 0)
                    count[$1]++;
                close(ARGV[i]);
            }
            for (i = 1; i < ARGC; i++)
            {
                while ((getline < ARGV[i]) > 0)
                {
                    count[$1]--;
                    if (count[$1] == 0) print;
                }
                close(ARGV[i]);
            }
        }' 

This ignores awk's innate 'read' loop and does all reading explicitly (you could replace BEGIN by END and would get the same result). The logic is closely based on the Perl logic in many ways. Tested on Mac OS X 10.7.5 with both BSD awk and GNU awk. Interestingly, GNU awk insisted on the parentheses in the calls to close where BSD awk did not. The close() calls are necessary in the first loop to make the second loop work at all. The close() calls in the second loop are there to preserve symmetry and for tidiness — but they might also be relevant when you get around to processing a few hundred files in a single run.




回答2:


Here's one way using GNU awk:

awk -F, '{ array[$1]=$0 } END { for (i in array) print array[i] }' $(ls -v *.csv)

Explanation: Reading a numerically sorted glob of files, we add the first column of each file to an associative array whose value is the whole line. In this way, the duplicate that's kept is the one that occurs in the latest file. Once complete, loop through the keys of the array and print out the values. GNU awk does provide sorting abilities through asort() and asorti() functions, but piping the output to sort makes things much easier to read, and is probably quicker and more efficient.

You could do this if you require numerical sorting on the first column:

awk -F, '{ array[$1]=$0 } END { for (i in array) print array[i] | "sort -nk 1" }' $(ls -v *.csv)



回答3:


My answer is based on steve's

awk -F, '!count[$1]++' $(ls -rv *.csv)

{print $0} is implied in the awk statement.

Essentially awk prints only the first line whose $1 contains that value. Since the .csv files are listed in reversed natural order, this means for all the lines that has the same value for $1, only the one in the latest file is printed.

Note: This will not work if you have duplicates in the same file (i.e. if you have multiple instances of the same key within the same file)




回答4:


Regarding your sorting plan, it might be more practical to sort the individual files and then merge them, rather than concatenating and then sorting. The complexity of sorting using the sort program is likely to be O(n log(n)). If you have say 200000 lines per 50MB file, and 2000 files, n will be about 400 million, and n log(n) ~ 10^10. If instead you treat F files of R records each separately, the cost of sorting is O(F*R*log(R)) and the cost of merging is O(F*R*log(R)). These costs are high enough that separate sorting is not necessarily faster, but the process can be broken into convenient chunks so can be more easily checked as things go along. Here is a small-scale example, which supposes that comma can be used as a delimiter for the sort key. (A quote-delimited key field that contains commas would be a problem for the sort as shown.) Note that -s tells sort to do a stable sort, leaving lines with the same sort key in the order they were encountered.

for i in $(seq 1 8); do sort -t, -sk1,1 $i.csv > $i.tmp; done
sort -mt, -sk1,1 [1-8].tmp > 1-8.tmp

or if more cautious might save some intermediate results:

sort -mt, -sk1,1 [1-4].tmp > 1-4.tmp
sort -mt, -sk1,1 [5-8].tmp > 5-8.tmp
cp 1-4.tmp 5-8.tmp /backup/storage
sort -mt, -sk1,1 1-4.tmp 5-8.tmp > 1-8.tmp

Also, an advantage of doing separate sorts followed by a merge or merges is the ease of splitting the workload across multiple processors or systems.

After you sort and merge all the files (into, say, file X) it is fairly simple to write an awk program that at BEGIN reads a line from X and puts it in variable L. Thereafter, each time it reads a line from X, if the first field of $0 doesn't match L, it writes out L and sets L to $0. But if $0 does match L, it sets L to $0. At END, it writes out L.



来源:https://stackoverflow.com/questions/12888748/remove-dups-from-many-csv-files

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!