How to check if a value already exists to avoid duplicates?

前端 未结 17 924
小鲜肉
小鲜肉 2020-12-02 23:13

I\'ve got a table of URLs and I don\'t want any duplicate URLs. How do I check to see if a given URL is already in the table using PHP/MySQL?

相关标签:
17条回答
  • 2020-12-02 23:26

    The simple SQL solutions require a unique field; the logic solutions do not.

    You should normalize your urls to ensure there is no duplication. Functions in PHP such as strtolower() and urldecode() or rawurldecode().

    Assumptions: Your table name is 'websites', the column name for your url is 'url', and the arbitrary data to be associated with the url is in the column 'data'.

    Logic Solutions

    SELECT COUNT(*) AS UrlResults FROM websites WHERE url='http://www.domain.com'
    

    Test the previous query with if statements in SQL or PHP to ensure that it is 0 before you continue with an INSERT statement.

    Simple SQL Statements

    Scenario 1: Your db is a first come first serve table and you have no desire to have duplicate entries in the future.

    ALTER TABLE websites ADD UNIQUE (url)
    

    This will prevent any entries from being able to be entered in to the database if the url value already exists in that column.

    Scenario 2: You want the most up to date information for each url and don't want to duplicate content. There are two solutions for this scenario. (These solutions also require 'url' to be unique so the solution in Scenario 1 will also need to be carried out.)

    REPLACE INTO websites (url, data) VALUES ('http://www.domain.com', 'random data')
    

    This will trigger a DELETE action if a row exists followed by an INSERT in all cases, so be careful with ON DELETE declarations.

    INSERT INTO websites (url, data) VALUES ('http://www.domain.com', 'random data')
    ON DUPLICATE KEY UPDATE data='random data'
    

    This will trigger an UPDATE action if a row exists and an INSERT if it does not.

    0 讨论(0)
  • 2020-12-02 23:26

    If you want to insert urls into the table, but only those that don't exist already you can add a UNIQUE contraint on the column and in your INSERT query add IGNORE so that you don't get an error.

    Example: INSERT IGNORE INTO urls SET url = 'url-to-insert'

    0 讨论(0)
  • 2020-12-02 23:28

    The answer depends on whether you want to know when an attempt is made to enter a record with a duplicate field. If you don't care then use the "INSERT... ON DUPLICATE KEY" syntax as this will make your attempt quietly succeed without creating a duplicate.

    If on the other hand you want to know when such an event happens and prevent it, then you should use a unique key constraint which will cause the attempted insert/update to fail with a meaningful error.

    0 讨论(0)
  • 2020-12-02 23:29

    In considering a solution to this problem, you need to first define what a "duplicate URL" means for your project. This will determine how to canonicalize the URLs before adding them to the database.

    There are at least two definitions:

    1. Two URLs are considered duplicates if they represent the same resource knowing nothing about the corresponding web service that generates the corresponding content. Some considerations include:
      • The scheme and domain name portion of the URLs are case-insensitive, so HTTP://WWW.STACKOVERFLOW.COM/ is the same as http://www.stackoverflow.com/.
      • If one URL specifies a port, but it is the conventional port for the scheme and they are otherwise equivalent, then they are the same ( http://www.stackoverflow.com/ and http://www.stackoverflow.com:80/).
      • If the parameters in the query string are simple rearrangements and the parameter names are all different, then they are the same; e.g. http://authority/?a=test&b=test and http://authority/?b=test&a=test. Note that http://authority/?a%5B%5D=test1&a%5B%5D=test2 is not the same, by this first definition of sameness, as http://authority/?a%5B%5D=test2&a%5B%5D=test1.
      • If the scheme is HTTP or HTTPS, then the hash portions of the URLs can be removed, as this portion of the URL is not sent to the web server.
      • A shortened IPv6 address can be expanded.
      • Append a trailing forward slash to the authority only if it is missing.
      • Unicode canonicalization changes the referenced resource; e.g. you can't conclude that http://google.com/?q=%C3%84 (%C3%84 represents 'Ä' in UTF-8) is the same as http://google.com/?q=A%CC%88 (%CC%88 represents U+0308, COMBINING DIAERESIS).
      • If the scheme is HTTP or HTTPS, 'www.' in one URL's authority can not simply be removed if the two URLs are otherwise equivalent, as the text of the domain name is sent as the value of the Host HTTP header, and some web servers use virtual hosts to send back different content based on this header. More generally, even if the domain names resolve to the same IP address, you can not conclude that the referenced resources are the same.
    2. Apply basic URL canonicalization (e.g. lower case the scheme and domain name, supply the default port, stable sort query parameters by parameter name, remove the hash portion in the case of HTTP and HTTPS, ...), and take into account knowledge of the web service. Maybe you will assume that all web services are smart enough to canonicalize Unicode input (Wikipedia is, for example), so you can apply Unicode Normalization Form Canonical Composition (NFC). You would strip 'www.' from all Stack Overflow URLs. You could use PostRank's postrank-uri code, ported to PHP, to remove all sorts of pieces of the URLs that are unnecessary (e.g. &utm_source=...).

    Definition 1 leads to a stable solution (i.e. there is no further canonicalization that can be performed and the canonicalization of a URL will not change). Definition 2, which I think is what a human considers the definition of URL canonicalization, leads to a canonicalization routine that can yield different results at different moments in time.

    Whichever definition you choose, I suggest that you use separate columns for the scheme, login, host, port, and path portions. This will allow you to use indexes intelligently. The columns for scheme and host can use a character collation (all character collations are case-insensitive in MySQL), but the columns for the login and path need to use a binary, case-insensitive collation. Also, if you use Definition 2, you need to preserve the original scheme, authority, and path portions, as certain canonicalization rules might be added or removed from time to time.

    EDIT: Here are example table definitions:

    CREATE TABLE `urls1` (
        `id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
        `scheme` VARCHAR(20) NOT NULL,
        `canonical_login` VARCHAR(100) DEFAULT NULL COLLATE 'utf8mb4_bin',
        `canonical_host` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci', /* the "ci" stands for case-insensitive. Also, we want 'utf8mb4_unicode_ci'
    rather than 'utf8mb4_general_ci' because 'utf8mb4_general_ci' treats accented characters as equivalent. */
        `port` INT UNSIGNED,
        `canonical_path` VARCHAR(4096) NOT NULL COLLATE 'utf8mb4_bin',
    
        PRIMARY KEY (`id`),
        INDEX (`canonical_host`(10), `scheme`)
    ) ENGINE = 'InnoDB';
    
    
    CREATE TABLE `urls2` (
        `id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
        `canonical_scheme` VARCHAR(20) NOT NULL,
        `canonical_login` VARCHAR(100) DEFAULT NULL COLLATE 'utf8mb4_bin',
        `canonical_host` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci',
        `port` INT UNSIGNED,
        `canonical_path` VARCHAR(4096) NOT NULL COLLATE 'utf8mb4_bin',
    
        `orig_scheme` VARCHAR(20) NOT NULL, 
        `orig_login` VARCHAR(100) DEFAULT NULL COLLATE 'utf8mb4_bin',
        `orig_host` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci',
        `orig_path` VARCHAR(4096) NOT NULL COLLATE 'utf8mb4_bin',
    
        PRIMARY KEY (`id`),
        INDEX (`canonical_host`(10), `canonical_scheme`),
        INDEX (`orig_host`(10), `orig_scheme`)
    ) ENGINE = 'InnoDB';
    

    Table `urls1` is for storing canonical URLs according to definition 1. Table `urls2` is for storing canonical URLs according to definition 2.

    Unfortunately you will not be able to specify a UNIQUE constraint on the tuple (`scheme`/`canonical_scheme`, `canonical_login`, `canonical_host`, `port`, `canonical_path`) as MySQL limits the length of InnoDB keys to 767 bytes.

    0 讨论(0)
  • 2020-12-02 23:30
    $url = "http://www.scroogle.com";
    
    $query  = "SELECT `id` FROM `urls` WHERE  `url` = '$url' ";
    $resultdb = mysql_query($query) or die(mysql_error());   
    list($idtemp) = mysql_fetch_array($resultdb) ;
    
    if(empty($idtemp)) // if $idtemp is empty the url doesn't exist and we go ahead and insert it into the db.
    { 
       mysql_query("INSERT INTO urls (`url` ) VALUES('$url') ") or die (mysql_error());
    }else{
       //do something else if the url already exists in the DB
    }
    
    0 讨论(0)
  • 2020-12-02 23:31

    i don't know the syntax for MySQL, but all you need to do is wrap your INSERT with IF statement that will query the table and see if the record with given url EXISTS, if it exists - don't insert a new record.

    if MSSQL you can do this:

    IF NOT EXISTS (SELECT 1 FROM YOURTABLE WHERE URL = 'URL')
    INSERT INTO YOURTABLE (...) VALUES (...)
    
    0 讨论(0)
提交回复
热议问题