Remove duplicate rows in MySQL

前端 未结 25 3265
囚心锁ツ
囚心锁ツ 2020-11-21 04:33

I have a table with the following fields:

id (Unique)
url (Unique)
title
company
site_id

Now, I need to remove rows having same titl

相关标签:
25条回答
  • 2020-11-21 05:30

    In Order to duplicate records with unique columns, e.g. COL1,COL2, COL3 should not be replicated (suppose we have missed 3 column unique in table structure and multiple duplicate entries have been made into the table)

    DROP TABLE TABLE_NAME_copy;
    CREATE TABLE TABLE_NAME_copy LIKE TABLE_NAME;
    INSERT INTO TABLE_NAME_copy
    SELECT * FROM TABLE_NAME
    GROUP BY COLUMN1, COLUMN2, COLUMN3; 
    DROP TABLE TABLE_NAME;
    ALTER TABLE TABLE_NAME_copy RENAME TO TABLE_NAME;
    

    Hope will help dev.

    0 讨论(0)
  • 2020-11-21 05:31

    if you have a large table with huge number of records then above solutions will not work or take too much time. Then we have a different solution

    -- Create temporary table
    
    CREATE TABLE temp_table LIKE table1;
    
    -- Add constraint
    ALTER TABLE temp_table ADD UNIQUE(title, company,site_id);
    
    -- Copy data
    INSERT IGNORE INTO temp_table SELECT * FROM table1;
    
    -- Rename and drop
    RENAME TABLE table1 TO old_table1, temp_table TO table1;
    DROP TABLE old_table1;
    
    0 讨论(0)
  • 2020-11-21 05:31

    This will delete the duplicate rows with same values for title, company and site. The first occurrence will be kept and rest all duplicates will be deleted

    DELETE t1 FROM tablename t1
    INNER JOIN tablename t2 
    WHERE 
        t1.id < t2.id AND
        t1.title = t2.title AND
        t1.company=t2.company AND
        t1.site_ID=t2.site_ID;
    
    0 讨论(0)
  • 2020-11-21 05:31

    I like to be a bit more specific as to which records I delete so here is my solution:

    delete
    from jobs c1
    where not c1.location = 'Paris'
    and  c1.site_id > 64218
    and exists 
    (  
    select * from jobs c2 
    where c2.site_id = c1.site_id
    and   c2.company = c1.company
    and   c2.location = c1.location
    and   c2.title = c1.title
    and   c2.site_id > 63412
    and   c2.site_id < 64219
    )
    
    0 讨论(0)
  • 2020-11-21 05:32

    Deleting duplicates on MySQL tables is a common issue, that's genarally the result of a missing constraint to avoid those duplicates before hand. But this common issue usually comes with specific needs... that do require specific approaches. The approach should be different depending on, for example, the size of the data, the duplicated entry that should be kept (generally the first or the last one), whether there are indexes to be kept, or whether we want to perform any additional action on the duplicated data.

    There are also some specificities on MySQL itself, such as not being able to reference the same table on a FROM cause when performing a table UPDATE (it'll raise MySQL error #1093). This limitation can be overcome by using an inner query with a temporary table (as suggested on some approaches above). But this inner query won't perform specially well when dealing with big data sources.

    However, a better approach does exist to remove duplicates, that's both efficient and reliable, and that can be easily adapted to different needs.

    The general idea is to create a new temporary table, usually adding a unique constraint to avoid further duplicates, and to INSERT the data from your former table into the new one, while taking care of the duplicates. This approach relies on simple MySQL INSERT queries, creates a new constraint to avoid further duplicates, and skips the need of using an inner query to search for duplicates and a temporary table that should be kept in memory (thus fitting big data sources too).

    This is how it can be achieved. Given we have a table employee, with the following columns:

    employee (id, first_name, last_name, start_date, ssn)
    

    In order to delete the rows with a duplicate ssn column, and keeping only the first entry found, the following process can be followed:

    -- create a new tmp_eployee table
    CREATE TABLE tmp_employee LIKE employee;
    
    -- add a unique constraint
    ALTER TABLE tmp_employee ADD UNIQUE(ssn);
    
    -- scan over the employee table to insert employee entries
    INSERT IGNORE INTO tmp_employee SELECT * FROM employee ORDER BY id;
    
    -- rename tables
    RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
    

    Technical explanation

    • Line #1 creates a new tmp_eployee table with exactly the same structure as the employee table
    • Line #2 adds a UNIQUE constraint to the new tmp_eployee table to avoid any further duplicates
    • Line #3 scans over the original employee table by id, inserting new employee entries into the new tmp_eployee table, while ignoring duplicated entries
    • Line #4 renames tables, so that the new employee table holds all the entries without the duplicates, and a backup copy of the former data is kept on the backup_employee table

    Using this approach, 1.6M registers were converted into 6k in less than 200s.

    Chetan, following this process, you could fast and easily remove all your duplicates and create a UNIQUE constraint by running:

    CREATE TABLE tmp_jobs LIKE jobs;
    
    ALTER TABLE tmp_jobs ADD UNIQUE(site_id, title, company);
    
    INSERT IGNORE INTO tmp_jobs SELECT * FROM jobs ORDER BY id;
    
    RENAME TABLE jobs TO backup_jobs, tmp_jobs TO jobs;
    

    Of course, this process can be further modified to adapt it for different needs when deleting duplicates. Some examples follow.

    ✔ Variation for keeping the last entry instead of the first one

    Sometimes we need to keep the last duplicated entry instead of the first one.

    CREATE TABLE tmp_employee LIKE employee;
    
    ALTER TABLE tmp_employee ADD UNIQUE(ssn);
    
    INSERT IGNORE INTO tmp_employee SELECT * FROM employee ORDER BY id DESC;
    
    RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
    
    • On line #3, the ORDER BY id DESC clause makes the last ID's to get priority over the rest

    ✔ Variation for performing some tasks on the duplicates, for example keeping a count on the duplicates found

    Sometimes we need to perform some further processing on the duplicated entries that are found (such as keeping a count of the duplicates).

    CREATE TABLE tmp_employee LIKE employee;
    
    ALTER TABLE tmp_employee ADD UNIQUE(ssn);
    
    ALTER TABLE tmp_employee ADD COLUMN n_duplicates INT DEFAULT 0;
    
    INSERT INTO tmp_employee SELECT * FROM employee ORDER BY id ON DUPLICATE KEY UPDATE n_duplicates=n_duplicates+1;
    
    RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
    
    • On line #3, a new column n_duplicates is created
    • On line #4, the INSERT INTO ... ON DUPLICATE KEY UPDATE query is used to perform an additional update when a duplicate is found (in this case, increasing a counter) The INSERT INTO ... ON DUPLICATE KEY UPDATE query can be used to perform different types of updates for the duplicates found.

    ✔ Variation for regenerating the auto-incremental field id

    Sometimes we use an auto-incremental field and, in order the keep the index as compact as possible, we can take advantage of the deletion of the duplicates to regenerate the auto-incremental field in the new temporary table.

    CREATE TABLE tmp_employee LIKE employee;
    
    ALTER TABLE tmp_employee ADD UNIQUE(ssn);
    
    INSERT IGNORE INTO tmp_employee SELECT (first_name, last_name, start_date, ssn) FROM employee ORDER BY id;
    
    RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
    
    • On line #3, instead of selecting all the fields on the table, the id field is skipped so that the DB engine generates a new one automatically

    ✔ Further variations

    Many further modifications are also doable depending on the desired behavior. As an example, the following queries will use a second temporary table to, besides 1) keep the last entry instead of the first one; and 2) increase a counter on the duplicates found; also 3) regenerate the auto-incremental field id while keeping the entry order as it was on the former data.

    CREATE TABLE tmp_employee LIKE employee;
    
    ALTER TABLE tmp_employee ADD UNIQUE(ssn);
    
    ALTER TABLE tmp_employee ADD COLUMN n_duplicates INT DEFAULT 0;
    
    INSERT INTO tmp_employee SELECT * FROM employee ORDER BY id DESC ON DUPLICATE KEY UPDATE n_duplicates=n_duplicates+1;
    
    CREATE TABLE tmp_employee2 LIKE tmp_employee;
    
    INSERT INTO tmp_employee2 SELECT (first_name, last_name, start_date, ssn) FROM tmp_employee ORDER BY id;
    
    DROP TABLE tmp_employee;
    
    RENAME TABLE employee TO backup_employee, tmp_employee2 TO employee;
    
    0 讨论(0)
  • 2020-11-21 05:32

    You can easily delete the duplicate records from this code..

    $qry = mysql_query("SELECT * from cities");
    while($qry_row = mysql_fetch_array($qry))
    {
    $qry2 = mysql_query("SELECT * from cities2 where city = '".$qry_row['city']."'");
    
    if(mysql_num_rows($qry2) > 1){
        while($row = mysql_fetch_array($qry2)){
            $city_arry[] = $row;
    
            }
    
        $total = sizeof($city_arry) - 1;
            for($i=1; $i<=$total; $i++){
    
    
                mysql_query( "delete from cities2 where town_id = '".$city_arry[$i][0]."'");
    
                }
        }
        //exit;
    }
    
    0 讨论(0)
提交回复
热议问题