I\'ve got a table of URLs and I don\'t want any duplicate URLs. How do I check to see if a given URL is already in the table using PHP/MySQL?
The simple SQL solutions require a unique field; the logic solutions do not.
You should normalize your urls to ensure there is no duplication. Functions in PHP such as strtolower() and urldecode() or rawurldecode().
Assumptions: Your table name is 'websites', the column name for your url is 'url', and the arbitrary data to be associated with the url is in the column 'data'.
Logic Solutions
SELECT COUNT(*) AS UrlResults FROM websites WHERE url='http://www.domain.com'
Test the previous query with if statements in SQL or PHP to ensure that it is 0 before you continue with an INSERT statement.
Simple SQL Statements
Scenario 1: Your db is a first come first serve table and you have no desire to have duplicate entries in the future.
ALTER TABLE websites ADD UNIQUE (url)
This will prevent any entries from being able to be entered in to the database if the url value already exists in that column.
Scenario 2: You want the most up to date information for each url and don't want to duplicate content. There are two solutions for this scenario. (These solutions also require 'url' to be unique so the solution in Scenario 1 will also need to be carried out.)
REPLACE INTO websites (url, data) VALUES ('http://www.domain.com', 'random data')
This will trigger a DELETE action if a row exists followed by an INSERT in all cases, so be careful with ON DELETE declarations.
INSERT INTO websites (url, data) VALUES ('http://www.domain.com', 'random data')
ON DUPLICATE KEY UPDATE data='random data'
This will trigger an UPDATE action if a row exists and an INSERT if it does not.
If you want to insert urls into the table, but only those that don't exist already you can add a UNIQUE contraint on the column and in your INSERT query add IGNORE so that you don't get an error.
Example: INSERT IGNORE INTO urls
SET url = 'url-to-insert'
The answer depends on whether you want to know when an attempt is made to enter a record with a duplicate field. If you don't care then use the "INSERT... ON DUPLICATE KEY" syntax as this will make your attempt quietly succeed without creating a duplicate.
If on the other hand you want to know when such an event happens and prevent it, then you should use a unique key constraint which will cause the attempted insert/update to fail with a meaningful error.
In considering a solution to this problem, you need to first define what a "duplicate URL" means for your project. This will determine how to canonicalize the URLs before adding them to the database.
There are at least two definitions:
%C3%84
represents 'Ä' in UTF-8) is the same as http://google.com/?q=A%CC%88 (%CC%88
represents U+0308, COMBINING DIAERESIS).www.
' in one URL's authority can not simply be removed if the two URLs are otherwise equivalent, as the text of the domain name is sent as the value of the Host
HTTP header, and some web servers use virtual hosts to send back different content based on this header. More generally, even if the domain names resolve to the same IP address, you can not conclude that the referenced resources are the same.www.
' from all Stack Overflow URLs. You could use PostRank's postrank-uri code, ported to PHP, to remove all sorts of pieces of the URLs that are unnecessary (e.g. &utm_source=...
).Definition 1 leads to a stable solution (i.e. there is no further canonicalization that can be performed and the canonicalization of a URL will not change). Definition 2, which I think is what a human considers the definition of URL canonicalization, leads to a canonicalization routine that can yield different results at different moments in time.
Whichever definition you choose, I suggest that you use separate columns for the scheme, login, host, port, and path portions. This will allow you to use indexes intelligently. The columns for scheme and host can use a character collation (all character collations are case-insensitive in MySQL), but the columns for the login and path need to use a binary, case-insensitive collation. Also, if you use Definition 2, you need to preserve the original scheme, authority, and path portions, as certain canonicalization rules might be added or removed from time to time.
EDIT: Here are example table definitions:
CREATE TABLE `urls1` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`scheme` VARCHAR(20) NOT NULL,
`canonical_login` VARCHAR(100) DEFAULT NULL COLLATE 'utf8mb4_bin',
`canonical_host` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci', /* the "ci" stands for case-insensitive. Also, we want 'utf8mb4_unicode_ci'
rather than 'utf8mb4_general_ci' because 'utf8mb4_general_ci' treats accented characters as equivalent. */
`port` INT UNSIGNED,
`canonical_path` VARCHAR(4096) NOT NULL COLLATE 'utf8mb4_bin',
PRIMARY KEY (`id`),
INDEX (`canonical_host`(10), `scheme`)
) ENGINE = 'InnoDB';
CREATE TABLE `urls2` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`canonical_scheme` VARCHAR(20) NOT NULL,
`canonical_login` VARCHAR(100) DEFAULT NULL COLLATE 'utf8mb4_bin',
`canonical_host` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci',
`port` INT UNSIGNED,
`canonical_path` VARCHAR(4096) NOT NULL COLLATE 'utf8mb4_bin',
`orig_scheme` VARCHAR(20) NOT NULL,
`orig_login` VARCHAR(100) DEFAULT NULL COLLATE 'utf8mb4_bin',
`orig_host` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci',
`orig_path` VARCHAR(4096) NOT NULL COLLATE 'utf8mb4_bin',
PRIMARY KEY (`id`),
INDEX (`canonical_host`(10), `canonical_scheme`),
INDEX (`orig_host`(10), `orig_scheme`)
) ENGINE = 'InnoDB';
Table `urls1` is for storing canonical URLs according to definition 1. Table `urls2` is for storing canonical URLs according to definition 2.
Unfortunately you will not be able to specify a UNIQUE
constraint on the tuple (`scheme`/`canonical_scheme`, `canonical_login`, `canonical_host`, `port`, `canonical_path`) as MySQL limits the length of InnoDB keys to 767 bytes.
$url = "http://www.scroogle.com";
$query = "SELECT `id` FROM `urls` WHERE `url` = '$url' ";
$resultdb = mysql_query($query) or die(mysql_error());
list($idtemp) = mysql_fetch_array($resultdb) ;
if(empty($idtemp)) // if $idtemp is empty the url doesn't exist and we go ahead and insert it into the db.
{
mysql_query("INSERT INTO urls (`url` ) VALUES('$url') ") or die (mysql_error());
}else{
//do something else if the url already exists in the DB
}
i don't know the syntax for MySQL, but all you need to do is wrap your INSERT with IF statement that will query the table and see if the record with given url EXISTS, if it exists - don't insert a new record.
if MSSQL you can do this:
IF NOT EXISTS (SELECT 1 FROM YOURTABLE WHERE URL = 'URL')
INSERT INTO YOURTABLE (...) VALUES (...)