I\'ve got a table of URLs and I don\'t want any duplicate URLs. How do I check to see if a given URL is already in the table using PHP/MySQL?
First things first. If you haven't already created the table, or you created a table but do not have data in in then you need to add a unique constriant, or a unique index. More information about choosing between index or constraints follows at the end of the post. But they both accomplish the same thing, enforcing that the column only contains unique values.
To create a table with a unique index on this column, you can use.
CREATE TABLE MyURLTable(
ID INTEGER NOT NULL AUTO_INCREMENT
,URL VARCHAR(512)
,PRIMARY KEY(ID)
,UNIQUE INDEX IDX_URL(URL)
);
If you just want a unique constraint, and no index on that table, you can use
CREATE TABLE MyURLTable(
ID INTEGER NOT NULL AUTO_INCREMENT
,URL VARCHAR(512)
,PRIMARY KEY(ID)
,CONSTRAINT UNIQUE UNIQUE_URL(URL)
);
Now, if you already have a table, and there is no data in it, then you can add the index or constraint to the table with one of the following pieces of code.
ALTER TABLE MyURLTable
ADD UNIQUE INDEX IDX_URL(URL);
ALTER TABLE MyURLTable
ADD CONSTRAINT UNIQUE UNIQUE_URL(URL);
Now, you may already have a table with some data in it. In that case, you may already have some duplicate data in it. You can try creating the constriant or index shown above, and it will fail if you already have duplicate data. If you don't have duplicate data, great, if you do, you'll have to remove the duplicates. You can see a lit of urls with duplicates using the following query.
SELECT URL,COUNT(*),MIN(ID)
FROM MyURLTable
GROUP BY URL
HAVING COUNT(*) > 1;
To delete rows that are duplicates, and keep one, do the following:
DELETE RemoveRecords
FROM MyURLTable As RemoveRecords
LEFT JOIN
(
SELECT MIN(ID) AS ID
FROM MyURLTable
GROUP BY URL
HAVING COUNT(*) > 1
UNION
SELECT ID
FROM MyURLTable
GROUP BY URL
HAVING COUNT(*) = 1
) AS KeepRecords
ON RemoveRecords.ID = KeepRecords.ID
WHERE KeepRecords.ID IS NULL;
Now that you have deleted all the records, you can go ahead and create you index or constraint. Now, if you want to insert a value into your database, you should use something like.
INSERT IGNORE INTO MyURLTable(URL)
VALUES('http://www.example.com');
That will attempt to do the insert, and if it finds a duplicate, nothing will happen. Now, lets say you have other columns, you can do something like this.
INSERT INTO MyURLTable(URL,Visits)
VALUES('http://www.example.com',1)
ON DUPLICATE KEY UPDATE Visits=Visits+1;
That will look try to insert the value, and if it finds the URL, then it will update the record by incrementing the visits counter. Of course, you can always do a plain old insert, and handle the resulting error in your PHP Code. Now, as for whether or not you should use constraints or indexes, that depends on a lot of factors. Indexes make for faster lookups, so your performance will be better as the table gets bigger, but storing the index will take up extra space. Indexes also usually make inserts and updates take longer as well, because it has to update the index. However, since the value will have to be looked up either way, to enforce the uniqueness, in this case, It may be quicker to just have the index anyway. As for anything performance related, the answer is try both options and profile the results to see which works best for your situation.
Make the column the primary key
Are you concerned purely about URLs that are the exact same string .. if so there is a lot of good advice in other answers. Or do you also have to worry about canonization?
For example: http://google.com and http://go%4fgle.com are the exact same URL, but would be allowed as duplicates by any of the database only techniques. If this is an issue you should preprocess the URLs to resolve and character escape sequences.
Depending where the URLs are coming from you will also have to worry about parameters and whether they are significant in your application.
First, prepare the database.
UNIQUE (url, resource_locator)
.Second, prepare the URL.
Think about trimming trailing characters. For example, these two URLs from amazon.com point to the same product. You probably want to store the second version, not the first.
http://www.amazon.com/Systemantics-Systems-Work-Especially-They/dp/070450331X/ref=sr_1_1?ie=UTF8&qid=1313583998&sr=8-1
http://www.amazon.com/Systemantics-Systems-Work-Especially-They/dp/070450331X
Decode encoded URLs. (See php's urldecode() function. Note carefully its shortcomings, as described in that page's comments.) Personally, I'd rather handle these kinds of transformations in the database rather than in client code. That would involve revoking permissions on the tables and views, and allowing inserts and updates only through stored procedures; the stored procedures handle all the string operations that put the URL into a canonical form. But keep an eye on performance when you try that. CHECK() constraints (see above) are your safety net.
Third, if you're inserting only the URL, don't test for its existence first. Instead, try to insert and trap the error that you'll get if the value already exists. Testing and inserting hits the database twice for every new URL. Insert-and-trap just hits the database once. Note carefully that insert-and-trap isn't the same thing as insert-and-ignore-errors. Only one particular error means you violated the unique constraint; other errors mean there are other problems.
On the other hand, if you're inserting the URL along with some other data in the same row, you need to decide ahead of time whether you'll handle duplicate urls by
REPLACE eliminates the need to trap duplicate key errors, but it might have unfortunate side effects if there are foreign key references.
If you don't want to have duplicates you can do following:
If multiple users can insert data to DB, method suggested by @Jeremy Ruten, can lead to an error: after you performed a check someone can insert similar data to the table.