I have a 40 MB csv file with 50,000 records. Its a giant product listing. Each row has close to 20 fields. [Item#, UPC, Desc, etc]
How can I,
a) Find and Pri
Try the following:
# Sort before using the uniq command
sort largefile.csv | sort | uniq -d
uniq is a very basic command and only reports uniqueness / duplicates that are next to each other.
You could possibly use SQLite shell to import your csv file and create indexes to perform SQL commands faster.
Find and print duplicate rows in Perl:
perl -ne 'print if $SEEN{$_}++' < input-file
Find and print rows with duplicate columns in Perl -- let's say the 5th column of where fields are separated by commas:
perl -F/,/ -ane 'print if $SEEN{$F[4]}++' < input-file
Here my (very simple) script to do it with Ruby & Rake Gem.
First create a RakeFile and write this code:
namespace :csv do
desc "find duplicates from CSV file on given column"
task :double, [:file, :column] do |t, args|
args.with_defaults(column: 0)
values = []
index = args.column.to_i
# parse given file row by row
File.open(args.file, "r").each_slice(1) do |line|
# get value of the given column
values << line.first.split(';')[index]
end
# compare length with & without uniq method
puts values.uniq.length == values.length ? "File does not contain duplicates" : "File contains duplicates"
end
end
Then to use it on the first column
$ rake csv:double["2017.04.07-Export.csv"]
File does not contain duplicates
And to use it on the second (for example)
$ rake csv:double["2017.04.07-Export.csv",1]
File contains duplicates
For the second part: read the file with Text::CSV into a hash keyed on your unique key(s), check whether a value exists for the hash before adding it. Something like this:
data (doesn't need to be sorted), in this example we need the first two columns to be unique:
1142,X426,Name1,Thing1
1142,X426,Name2,Thing2
1142,X426,Name3,Thing3
1142,X426,Name4,Thing4
1144,X427,Name5,Thing5
1144,X427,Name6,Thing6
1144,X427,Name7,Thing7
1144,X427,Name8,Thing8
code:
use strict;
use warnings;
use Text::CSV;
my %data;
my %dupes;
my @rows;
my $csv = Text::CSV->new ()
or die "Cannot use CSV: ".Text::CSV->error_diag ();
open my $fh, "<", "data.csv" or die "data.csv: $!";
while ( my $row = $csv->getline( $fh ) ) {
# insert row into row list
push @rows, $row;
# join the unique keys with the
# perl 'multidimensional array emulation'
# subscript character
my $key = join( $;, @{$row}[0,1] );
# if it was just one field, just use
# my $key = $row->[$keyfieldindex];
# if you were checking for full line duplicates (header lines):
# my $key = join($;, @$row);
# if %data has an entry for the record, add it to dupes
if (exists $data{$key}) { # duplicate
# if it isn't already duplicated
# add this row and the original
if (not exists $dupes{$key}) {
push @{$dupes{$key}}, $data{$key};
}
# add the duplicate row
push @{$dupes{$key}}, $row;
} else {
$data{ $key } = $row;
}
}
$csv->eof or $csv->error_diag();
close $fh;
# print out duplicates:
warn "Duplicate Values:\n";
warn "-----------------\n";
foreach my $key (keys %dupes) {
my @keys = split($;, $key);
warn "Key: @keys\n";
foreach my $dupe (@{$dupes{$key}}) {
warn "\tData: @$dupe\n";
}
}
Which prints out something like this:
Duplicate Values:
-----------------
Key: 1142 X426
Data: 1142 X426 Name1 Thing1
Data: 1142 X426 Name2 Thing2
Data: 1142 X426 Name3 Thing3
Data: 1142 X426 Name4 Thing4
Key: 1144 X427
Data: 1144 X427 Name5 Thing5
Data: 1144 X427 Name6 Thing6
Data: 1144 X427 Name7 Thing7
Data: 1144 X427 Name8 Thing8