Remove duplicate records based on multiple columns?

前端 未结 7 1958
灰色年华
灰色年华 2020-12-04 07:46

I\'m using Heroku to host my Ruby on Rails application and for one reason or another, I may have some duplicate rows.

Is there a way to delete duplicate records base

相关标签:
7条回答
  • 2020-12-04 07:59

    Based on @aditya-sanghi's answer, with a more efficient way to find duplicates using SQL.

    Add this to your ApplicationRecord to be able to deduplicate any model:

    class ApplicationRecord < ActiveRecord::Base
      # …
    
      def self.destroy_duplicates_by(*columns)
        groups = select(columns).group(columns).having(Arel.star.count.gt(1))
        groups.each do |duplicates|
          records = where(duplicates.attributes.symbolize_keys.slice(*columns))
          records.offset(1).destroy_all
        end
      end
    end
    

    You can then call destroy_duplicates_by to destroy all records (except the first) that have the same values for the given columns. For example:

    Model.destroy_duplicates_by(:name, :year, :trim, :make_id)
    
    0 讨论(0)
  • 2020-12-04 08:00

    To run it on a migration I ended up doing like the following (based on the answer above by @aditya-sanghi)

    class AddUniqueIndexToXYZ < ActiveRecord::Migration
      def change
        # delete duplicates
        dedupe(XYZ, 'name', 'type')
    
        add_index :xyz, [:name, :type], unique: true
      end
    
      def dedupe(model, *key_attrs)
        model.select(key_attrs).group(key_attrs).having('count(*) > 1').each { |duplicates|
          dup_rows = model.where(duplicates.attributes.slice(key_attrs)).to_a
          # the first one we want to keep right?
          dup_rows.shift
    
          dup_rows.each{ |double| double.destroy } # duplicates can now be destroyed
        }
      end
    end
    
    0 讨论(0)
  • 2020-12-04 08:05

    If your User table data like below

    User.all =>
    [
        #<User id: 15, name: "a", email: "a@gmail.com", created_at: "2013-08-06 08:57:09", updated_at: "2013-08-06 08:57:09">, 
        #<User id: 16, name: "a1", email: "a@gmail.com", created_at: "2013-08-06 08:57:20", updated_at: "2013-08-06 08:57:20">, 
        #<User id: 17, name: "b", email: "b@gmail.com", created_at: "2013-08-06 08:57:28", updated_at: "2013-08-06 08:57:28">, 
        #<User id: 18, name: "b1", email: "b1@gmail.com", created_at: "2013-08-06 08:57:35", updated_at: "2013-08-06 08:57:35">, 
        #<User id: 19, name: "b11", email: "b1@gmail.com", created_at: "2013-08-06 09:01:30", updated_at: "2013-08-06 09:01:30">, 
        #<User id: 20, name: "b11", email: "b1@gmail.com", created_at: "2013-08-06 09:07:58", updated_at: "2013-08-06 09:07:58">] 
    1.9.2p290 :099 > 
    

    Email id's are duplicate, so our aim is to remove all duplicate email ids from user table.

    Step 1:

    To get all distinct email records id.

    ids = User.select("MIN(id) as id").group(:email,:name).collect(&:id)
    => [15, 16, 18, 19, 17]
    

    Step 2:

    To remove duplicate id's from user table with distinct email records id.

    Now the ids array holds the following ids.

    [15, 16, 18, 19, 17]
    User.where("id NOT IN (?)",ids)  # To get all duplicate records
    User.where("id NOT IN (?)",ids).destroy_all
    

    ** RAILS 4 **

    ActiveRecord 4 introduces the .not method which allows you to write the following in Step 2:

    User.where.not(id: ids).destroy_all
    
    0 讨论(0)
  • 2020-12-04 08:07

    Similar to @Aditya Sanghi 's answer, but this way will be more performant because you are only selecting the duplicates, rather than loading every Model object into memory and then iterating over all of them.

    # returns only duplicates in the form of [[name1, year1, trim1], [name2, year2, trim2],...]
    duplicate_row_values = Model.select('name, year, trim, count(*)').group('name, year, trim').having('count(*) > 1').pluck(:name, :year, :trim)
    
    # load the duplicates and order however you wantm and then destroy all but one
    duplicate_row_values.each do |name, year, trim|
      Model.where(name: name, year: year, trim: trim).order(id: :desc)[1..-1].map(&:destroy)
    end
    

    Also, if you truly don't want duplicate data in this table, you probably want to add a multi-column unique index to the table, something along the lines of:

    add_index :models, [:name, :year, :trim], unique: true, name: 'index_unique_models' 
    
    0 讨论(0)
  • 2020-12-04 08:09

    You could try the following: (based on previous answers)

    ids = Model.group('name, year, trim').pluck('MIN(id)')
    

    to get all valid records. And then:

    Model.where.not(id: ids).destroy_all
    

    to remove the unneeded records. And certainly, you can make a migration that adds a unique index for the three columns so this is enforced at the DB level:

    add_index :models, [:name, :year, :trim], unique: true
    
    0 讨论(0)
  • 2020-12-04 08:14
    class Model
    
      def self.dedupe
        # find all models and group them on keys which should be common
        grouped = all.group_by{|model| [model.name,model.year,model.trim,model.make_id] }
        grouped.values.each do |duplicates|
          # the first one we want to keep right?
          first_one = duplicates.shift # or pop for last one
          # if there are any more left, they are duplicates
          # so delete all of them
          duplicates.each{|double| double.destroy} # duplicates can now be destroyed
        end
      end
    
    end
    
    Model.dedupe
    
    • Find All
    • Group them on keys which you need for uniqueness
    • Loop on the grouped model's values of the hash
    • remove the first value because you want to retain one copy
    • delete the rest
    0 讨论(0)
提交回复
热议问题