Script to find duplicates in a csv file

前端 未结 5 1134
旧巷少年郎
旧巷少年郎 2021-01-17 17:02

I have a 40 MB csv file with 50,000 records. Its a giant product listing. Each row has close to 20 fields. [Item#, UPC, Desc, etc]

How can I,

a) Find and Pri

相关标签:
5条回答
  • 2021-01-17 17:29

    Try the following:

    # Sort before using the uniq command
    sort largefile.csv | sort | uniq -d
    

    uniq is a very basic command and only reports uniqueness / duplicates that are next to each other.

    0 讨论(0)
  • 2021-01-17 17:38

    You could possibly use SQLite shell to import your csv file and create indexes to perform SQL commands faster.

    0 讨论(0)
  • 2021-01-17 17:39

    Find and print duplicate rows in Perl:

    perl -ne 'print if $SEEN{$_}++' < input-file
    

    Find and print rows with duplicate columns in Perl -- let's say the 5th column of where fields are separated by commas:

    perl -F/,/ -ane 'print if $SEEN{$F[4]}++' < input-file
    
    0 讨论(0)
  • 2021-01-17 17:43

    Here my (very simple) script to do it with Ruby & Rake Gem.

    First create a RakeFile and write this code:

    namespace :csv do
      desc "find duplicates from CSV file on given column"
      task :double, [:file, :column] do |t, args|
        args.with_defaults(column: 0)
        values = []
        index  = args.column.to_i
        # parse given file row by row
        File.open(args.file, "r").each_slice(1) do |line|
          # get value of the given column
          values << line.first.split(';')[index]
        end
        # compare length with & without uniq method 
        puts values.uniq.length == values.length ? "File does not contain duplicates" : "File contains duplicates"
      end
    end
    

    Then to use it on the first column

    $ rake csv:double["2017.04.07-Export.csv"] 
    File does not contain duplicates
    

    And to use it on the second (for example)

    $ rake csv:double["2017.04.07-Export.csv",1] 
    File contains duplicates
    
    0 讨论(0)
  • 2021-01-17 17:47

    For the second part: read the file with Text::CSV into a hash keyed on your unique key(s), check whether a value exists for the hash before adding it. Something like this:

    data (doesn't need to be sorted), in this example we need the first two columns to be unique:

    1142,X426,Name1,Thing1
    1142,X426,Name2,Thing2
    1142,X426,Name3,Thing3
    1142,X426,Name4,Thing4
    1144,X427,Name5,Thing5
    1144,X427,Name6,Thing6
    1144,X427,Name7,Thing7
    1144,X427,Name8,Thing8
    

    code:

    use strict;
    use warnings;
    use Text::CSV;
    
    my %data;
    my %dupes;
    my @rows;
    my $csv = Text::CSV->new ()
                            or die "Cannot use CSV: ".Text::CSV->error_diag ();
    
    open my $fh, "<", "data.csv" or die "data.csv: $!";
    while ( my $row = $csv->getline( $fh ) ) {
        # insert row into row list  
        push @rows, $row;
        # join the unique keys with the
        # perl 'multidimensional array emulation' 
        # subscript  character
        my $key = join( $;, @{$row}[0,1] ); 
        # if it was just one field, just use
        # my $key = $row->[$keyfieldindex];
        # if you were checking for full line duplicates (header lines):
        # my $key = join($;, @$row);
        # if %data has an entry for the record, add it to dupes
        if (exists $data{$key}) { # duplicate 
            # if it isn't already duplicated
            # add this row and the original 
            if (not exists $dupes{$key}) {
                push @{$dupes{$key}}, $data{$key};
            }
            # add the duplicate row
            push @{$dupes{$key}}, $row;
        } else {
            $data{ $key } = $row;
        }
    }
    
    $csv->eof or $csv->error_diag();
    close $fh;
    # print out duplicates:
    warn "Duplicate Values:\n";
    warn "-----------------\n";
    foreach my $key (keys %dupes) {
        my @keys = split($;, $key);
        warn "Key: @keys\n";
        foreach my $dupe (@{$dupes{$key}}) {
            warn "\tData: @$dupe\n";
        }
    }
    

    Which prints out something like this:

    Duplicate Values:
    -----------------
    Key: 1142 X426
        Data: 1142 X426 Name1 Thing1
        Data: 1142 X426 Name2 Thing2
        Data: 1142 X426 Name3 Thing3
        Data: 1142 X426 Name4 Thing4
    Key: 1144 X427
        Data: 1144 X427 Name5 Thing5
        Data: 1144 X427 Name6 Thing6
        Data: 1144 X427 Name7 Thing7
        Data: 1144 X427 Name8 Thing8
    
    0 讨论(0)
提交回复
热议问题