How to check if a value already exists to avoid duplicates?

前端 未结 17 933
小鲜肉
小鲜肉 2020-12-02 23:13

I\'ve got a table of URLs and I don\'t want any duplicate URLs. How do I check to see if a given URL is already in the table using PHP/MySQL?

相关标签:
17条回答
  • 2020-12-02 23:43

    First things first. If you haven't already created the table, or you created a table but do not have data in in then you need to add a unique constriant, or a unique index. More information about choosing between index or constraints follows at the end of the post. But they both accomplish the same thing, enforcing that the column only contains unique values.

    To create a table with a unique index on this column, you can use.

    CREATE TABLE MyURLTable(
    ID INTEGER NOT NULL AUTO_INCREMENT
    ,URL VARCHAR(512)
    ,PRIMARY KEY(ID)
    ,UNIQUE INDEX IDX_URL(URL)
    );
    

    If you just want a unique constraint, and no index on that table, you can use

    CREATE TABLE MyURLTable(
    ID INTEGER NOT NULL AUTO_INCREMENT
    ,URL VARCHAR(512)
    ,PRIMARY KEY(ID)
    ,CONSTRAINT UNIQUE UNIQUE_URL(URL)
    );
    

    Now, if you already have a table, and there is no data in it, then you can add the index or constraint to the table with one of the following pieces of code.

    ALTER TABLE MyURLTable
    ADD UNIQUE INDEX IDX_URL(URL);
    
    ALTER TABLE MyURLTable
    ADD CONSTRAINT UNIQUE UNIQUE_URL(URL);
    

    Now, you may already have a table with some data in it. In that case, you may already have some duplicate data in it. You can try creating the constriant or index shown above, and it will fail if you already have duplicate data. If you don't have duplicate data, great, if you do, you'll have to remove the duplicates. You can see a lit of urls with duplicates using the following query.

    SELECT URL,COUNT(*),MIN(ID) 
    FROM MyURLTable
    GROUP BY URL
    HAVING COUNT(*) > 1;
    

    To delete rows that are duplicates, and keep one, do the following:

    DELETE RemoveRecords
    FROM MyURLTable As RemoveRecords
    LEFT JOIN 
    (
    SELECT MIN(ID) AS ID
    FROM MyURLTable
    GROUP BY URL
    HAVING COUNT(*) > 1
    UNION
    SELECT ID
    FROM MyURLTable
    GROUP BY URL
    HAVING COUNT(*) = 1
    ) AS KeepRecords
    ON RemoveRecords.ID = KeepRecords.ID
    WHERE KeepRecords.ID IS NULL;
    

    Now that you have deleted all the records, you can go ahead and create you index or constraint. Now, if you want to insert a value into your database, you should use something like.

    INSERT IGNORE INTO MyURLTable(URL)
    VALUES('http://www.example.com');
    

    That will attempt to do the insert, and if it finds a duplicate, nothing will happen. Now, lets say you have other columns, you can do something like this.

    INSERT INTO MyURLTable(URL,Visits) 
    VALUES('http://www.example.com',1)
    ON DUPLICATE KEY UPDATE Visits=Visits+1;
    

    That will look try to insert the value, and if it finds the URL, then it will update the record by incrementing the visits counter. Of course, you can always do a plain old insert, and handle the resulting error in your PHP Code. Now, as for whether or not you should use constraints or indexes, that depends on a lot of factors. Indexes make for faster lookups, so your performance will be better as the table gets bigger, but storing the index will take up extra space. Indexes also usually make inserts and updates take longer as well, because it has to update the index. However, since the value will have to be looked up either way, to enforce the uniqueness, in this case, It may be quicker to just have the index anyway. As for anything performance related, the answer is try both options and profile the results to see which works best for your situation.

    0 讨论(0)
  • 2020-12-02 23:43

    Make the column the primary key

    0 讨论(0)
  • 2020-12-02 23:46

    Are you concerned purely about URLs that are the exact same string .. if so there is a lot of good advice in other answers. Or do you also have to worry about canonization?

    For example: http://google.com and http://go%4fgle.com are the exact same URL, but would be allowed as duplicates by any of the database only techniques. If this is an issue you should preprocess the URLs to resolve and character escape sequences.

    Depending where the URLs are coming from you will also have to worry about parameters and whether they are significant in your application.

    0 讨论(0)
  • 2020-12-02 23:46

    First, prepare the database.

    • Domain names aren't case-sensitive, but you have to assume the rest of a URL is. (Not all web servers respect case in URLs, but most do, and you can't easily tell by looking.)
    • Assuming you need to store more than a domain name, use a case-sensitive collation.
    • If you decide to store the URL in two columns--one for the domain name and one for the resource locator--consider using a case-insensitive collation for the domain name, and a case-sensitive collation for the resource locator. If I were you, I'd test both ways (URL in one column vs. URL in two columns).
    • Put a UNIQUE constraint on the URL column. Or on the pair of columns, if you store the domain name and resource locator in separate columns, as UNIQUE (url, resource_locator).
    • Use a CHECK() constraint to keep encoded URLs out of the database. This CHECK() constraint is essential to keep bad data from coming in through a bulk copy or through the SQL shell.

    Second, prepare the URL.

    • Domain names aren't case-sensitive. If you store the full URL in one column, lowercase the domain name on all URLs. But be aware that some languages have uppercase letters that have no lowercase equivalent.
    • Think about trimming trailing characters. For example, these two URLs from amazon.com point to the same product. You probably want to store the second version, not the first.

      http://www.amazon.com/Systemantics-Systems-Work-Especially-They/dp/070450331X/ref=sr_1_1?ie=UTF8&qid=1313583998&sr=8-1

      http://www.amazon.com/Systemantics-Systems-Work-Especially-They/dp/070450331X

    • Decode encoded URLs. (See php's urldecode() function. Note carefully its shortcomings, as described in that page's comments.) Personally, I'd rather handle these kinds of transformations in the database rather than in client code. That would involve revoking permissions on the tables and views, and allowing inserts and updates only through stored procedures; the stored procedures handle all the string operations that put the URL into a canonical form. But keep an eye on performance when you try that. CHECK() constraints (see above) are your safety net.

    Third, if you're inserting only the URL, don't test for its existence first. Instead, try to insert and trap the error that you'll get if the value already exists. Testing and inserting hits the database twice for every new URL. Insert-and-trap just hits the database once. Note carefully that insert-and-trap isn't the same thing as insert-and-ignore-errors. Only one particular error means you violated the unique constraint; other errors mean there are other problems.

    On the other hand, if you're inserting the URL along with some other data in the same row, you need to decide ahead of time whether you'll handle duplicate urls by

    • deleting the old row and inserting a new one (See MySQL's REPLACE extension to SQL)
    • updating existing values (See ON DUPLICATE KEY UPDATE)
    • ignoring the issue
    • requiring the user to take further action

    REPLACE eliminates the need to trap duplicate key errors, but it might have unfortunate side effects if there are foreign key references.

    0 讨论(0)
  • 2020-12-02 23:49

    If you don't want to have duplicates you can do following:

    • add uniqueness constraint
    • use "REPLACE" or "INSERT ... ON DUPLICATE KEY UPDATE" syntax

    If multiple users can insert data to DB, method suggested by @Jeremy Ruten, can lead to an error: after you performed a check someone can insert similar data to the table.

    0 讨论(0)
提交回复
热议问题