How do I lock read/write to MySQL tables so that I can select and then insert without other programs reading/writing to the database?

|▌冷眼眸甩不掉的悲伤 提交于 2019-11-28 21:12:39

You can lock tables using the MySQL LOCK TABLES command like this:

LOCK TABLES tablename WRITE;

# Do other queries here

UNLOCK TABLES;

See:

http://dev.mysql.com/doc/refman/5.5/en/lock-tables.html

You probably don't want to lock the table. If you do that you'll have to worry about trapping errors when the other crawlers try to write to the database - which is what you were thinking when you said "...terribly complex and relies on many other things."

Instead you should probably wrap the group of queries in a MySQL transaction (see http://dev.mysql.com/doc/refman/5.0/en/commit.html) like this:

START TRANSACTION;
SELECT @URL:=url FROM tablewiththeurls WHERE uncrawled=1 ORDER BY somecriterion LIMIT 1;
INSERT INTO loggingtable SET url=@URL;
COMMIT;

Or something close to that.

[edit] I just realized - you could probably do everything you need in a single query and not even have to worry about transactions. Something like this:

INSERT INTO loggingtable (url) SELECT url FROM tablewithurls u LEFT JOIN loggingtable l ON l.url=t.url WHERE {some criterion used to pick the url to work on} AND l.url IS NULL.

Well, table locks are one way to deal with that; but this makes parallel requests impossible. If the table is InnoDB you could force a row lock instead, using SELECT ... FOR UPDATE within a transaction.

BEGIN;

SELECT ... FROM your_table WHERE domainname = ... FOR UPDATE

# do whatever you have to do

COMMIT;

Please note that you will need an index on domainname (or whatever column you use in the WHERE-clause) for this to work, but this makes sense in general and I assume you will have that anyway.

Eljakim

I wouldn't use locking, or transactions.

The easiest way to go is to INSERT a record in the logging table if it's not yet present, and then check for that record.

Assume you have tblcrawels (cra_id) that is filled with your crawlers and tblurl (url_id) that is filled with the URLs, and a table tbllogging (log_cra_id, log_url_id) for your logfile.

You would run the following query if crawler 1 wants to start crawling url 2:

INSERT INTO tbllogging (log_cra_id, log_url_id) 
SELECT 1, url_id FROM tblurl LEFT JOIN tbllogging on url_id=log_url 
WHERE url_id=2 AND log_url_id IS NULL;

The next step is to check whether this record has been inserted.

SELECT * FROM tbllogging WHERE log_url_id=2 AND log_cra_id=1

If you get any results then crawler 1 can crawl this url. If you don't get any results this means that another crawler has inserted in the same line and is already crawling.

T. Brian Jones

I got some inspiration from @Eljakim's answer and started this new thread where I figured out a great trick. It doesn't involve locking anything and is very simple.

INSERT INTO crawlLog (companyId, timeStartCrawling)
SELECT id, now()
FROM companies
WHERE id NOT IN
(
    SELECT companyId
    FROM crawlLog AS crawlLogAlias
)
LIMIT 1
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!