Script to find duplicates in a csv file

前端 未结 5 1133
旧巷少年郎
旧巷少年郎 2021-01-17 17:02

I have a 40 MB csv file with 50,000 records. Its a giant product listing. Each row has close to 20 fields. [Item#, UPC, Desc, etc]

How can I,

a) Find and Pri

5条回答
  •  别那么骄傲
    2021-01-17 17:43

    Here my (very simple) script to do it with Ruby & Rake Gem.

    First create a RakeFile and write this code:

    namespace :csv do
      desc "find duplicates from CSV file on given column"
      task :double, [:file, :column] do |t, args|
        args.with_defaults(column: 0)
        values = []
        index  = args.column.to_i
        # parse given file row by row
        File.open(args.file, "r").each_slice(1) do |line|
          # get value of the given column
          values << line.first.split(';')[index]
        end
        # compare length with & without uniq method 
        puts values.uniq.length == values.length ? "File does not contain duplicates" : "File contains duplicates"
      end
    end
    

    Then to use it on the first column

    $ rake csv:double["2017.04.07-Export.csv"] 
    File does not contain duplicates
    

    And to use it on the second (for example)

    $ rake csv:double["2017.04.07-Export.csv",1] 
    File contains duplicates
    

提交回复
热议问题