I have a 40 MB csv file with 50,000 records. Its a giant product listing. Each row has close to 20 fields. [Item#, UPC, Desc, etc]
How can I,
a) Find and Pri
Here my (very simple) script to do it with Ruby & Rake Gem.
First create a RakeFile and write this code:
namespace :csv do
desc "find duplicates from CSV file on given column"
task :double, [:file, :column] do |t, args|
args.with_defaults(column: 0)
values = []
index = args.column.to_i
# parse given file row by row
File.open(args.file, "r").each_slice(1) do |line|
# get value of the given column
values << line.first.split(';')[index]
end
# compare length with & without uniq method
puts values.uniq.length == values.length ? "File does not contain duplicates" : "File contains duplicates"
end
end
Then to use it on the first column
$ rake csv:double["2017.04.07-Export.csv"]
File does not contain duplicates
And to use it on the second (for example)
$ rake csv:double["2017.04.07-Export.csv",1]
File contains duplicates