MySQL : Permanently getting “ Waiting for table metadata lock”

后端 未结 3 1336
盖世英雄少女心
盖世英雄少女心 2020-12-25 14:51

My MySQL database serves three webapps as the storage backend. However I recently encounter permanantly the error \"Waiting for table metadata lock\". It happen nearly all t

相关标签:
3条回答
  • 2020-12-25 15:32

    The accepted solution is, unfortunately, wrong. It is right as far as it says,

    Kill the connection with lock

    This is indeed (almost surely; see below) what to do. But then it suggests,

    Kill 1398
    

    ...and 1398 is not the connection with the lock. How could it be? 1398 is the connection waiting for the lock. This means it does not yet have the lock, and therefore, killing it avails nothing. The process holding the lock will still hold the lock, and the next thread trying to do something will therefore also stall and enter "Waiting for metadata lock" in due order.

    You have no guarantee that the processes "waiting for metadata lock" (WFML) won't also block, but you can be certain that killing only WFML processes will achieve exactly nothing.

    The real cause is that another process is holding the lock, and more importantly, SHOW FULL PROCESSLIST will not tell you directly which it is.

    It WILL tell you if the process is doing something, yes. Usually it works. Here, the process holding the lock is doing nothing, and hides among other threads also doing nothing.

    In this case the culprit is almost certainly process 1396, which started before process 1398 and is now in Sleep state, and has been for 46 seconds. Since 1396 clearly did all that it needed to do (as proved by the fact that it is now sleeping, and has done so for 46 seconds, as far as MySQL is concerned), no thread before that could have held a lock (or 1396 would also have stalled).

    IMPORTANT: if you connected to MySQL as a limited user, SHOW FULL PROCESSLIST will not show all the processes. So the lock might be held by a process that you don't see.

    A better SHOW PROCESSLIST

    SELECT ID, TIME, USER, HOST, DB, COMMAND, STATE, INFO
        FROM INFORMATION_SCHEMA.PROCESSLIST WHERE DB IS NOT NULL
        AND (`INFO` NOT LIKE '%INFORMATION_SCHEMA%' OR INFO IS NULL)
        ORDER BY `DB`, `TIME` DESC
    

    The above can be tuned to show only the processes in SLEEP state, and anyway it will sort them by time descending, so it is easier to find the process that is hanging (it should be the Sleep'ing one immediately before the ones "waiting for metadata lock").

    Quick and dirty solution, not really recommended

    Kill all processes in "Sleep" state, on the same database, that are older than the oldest thread in "waiting for metadata lock" state. This is what Arnaud Amaury would have done:

    • for each database that has at least one thread in WaitingForMetadataLock:
      • the oldest connection in WFML on that DB turns out to be Z seconds old
      • ALL the "Sleep" threads on that DB and older than Z must go. Start with the freshest ones, just in case.
      • If one older and non-sleeping connection exists on that DB, then maybe that is the one holding the lock, but it is doing something. You may of course kill it, but especially if it is an UPDATE/INSERT/DELETE, you do so at your own peril.

    Ninety-nine times out of one hundred, the thread to be killed is the youngest among those in Sleep state that are older than the older one waiting for metadata lock:

    TIME     STATUS
    319      Sleep
    205      Sleep
     19      Sleep                      <--- one of these two "19"
     19      Sleep                      <--- and probably this one(*)
     15      Waiting for metadata lock  <--- oldest WFML
     15      Waiting for metadata lock
     14      Waiting for metadata lock
    

    (*) the TIME order actually has milliseconds, or so I was told, it just doesn't show them. So while both processes have a Time value of 19, the lowest one ought to be younger.

    More focused fix

    Run SHOW ENGINE INNODB STATUS and look at the "TRANSACTION" section. You will find, among others, something like

    TRANSACTION 1701, ACTIVE 58 sec;2 lock struct(s), heap size 376, 1 row lock(s), undo log entries 1
    MySQL thread id 1396, OS thread handle 0x7fd06d675700, query id 1138 hostname 1.2.3.4 whatever;
    

    Now you check with SHOW FULL PROCESSLIST what is thread id 1396 doing with its #1701 transaction. Chances are it is in "Sleep" status. So: an active transaction (#1701) with an active lock, it has even done some changes as it has an undo log entry... but is currently idle. That and no other is the thread you need to kill. Losing those changes.

    Remember that doing nothing in MySQL does not mean doing nothing in general. If you get some records from MySQL and build a CSV for FTP upload, during the FTP upload the MySQL connection is idle.

    Actually if the process using MySQL and the MySQL server are on the same machine, that machine runs Linux, and you have root privileges, there's a way to find out which process has the connection that requested the lock. This in turn allows to determine (from CPU usage or, at worst, strace -ff -p pid) whether that process is really doing something or not, to help decide if it's safe to kill.

    Why does this happen?

    I see this happening with webapps that use "persistent" or "pooled" MySQL connections, which nowadays usually save very little time: the webapp instance terminated, but the connection did not, so its lock is still alive... and blocking everyone else.

    Another way to get the culprit if you have a recent MySQL, but not too recent since this is going to be deprecated, is (you need privileges again on the information schema)

    SELECT * FROM INFORMATION_SCHEMA.INNODB_LOCKS 
         WHERE LOCK_TRX_ID IN 
            (SELECT BLOCKING_TRX_ID FROM INFORMATION_SCHEMA.INNODB_LOCK_WAITS);
    

    Actual solution

    The problem is usually caused by this architecture:

     webapp (jar, php) --> container or application pool (mysqldb, php_module, fastcgi...) --> web server --> MySQL
    

    When the webapp dies, or the webapp lightweight thread instance dies, the container might not. And it is the container that keeps the connection open, so obviously the connection does not close. Quite predictably, MySQL does not consider the operation complete.

    If the webapp didn't clean after itself (no ROLLBACK or COMMIT for a transaction, no UNLOCK TABLES, etc.), then whatever that webapp started doing is still extant, and might still be blocking everyone else.

    There are then two solutions. The worse one is to lower the idle timeout. But guess what happens if you wait too long between two queries (exactly: "MySQL server has gone away". Or you could use mysql_ping if available (soon to be deprecated. There are workarounds for PDO). Or you might check for that error, and reopen the connection (this is the Python way). So - for a small performance fee - it's doable.

    The better, smarter solution is less straightforward to implement. Endeavour to have the script clean after itself, ensuring to catch all exception and deal with them properly, or, if possible, skip persistent connections altogether. Let each instance create its own connection or use a smart pool driver (in PHP PDO, use PDO::ATTR_PERSISTENT explicitly set to false). Alternatively (e.g. in PHP) you can have destruct and exception handlers force clean the connection by committing or rolling back transactions and issuing explicit table unlocks.

    0 讨论(0)
  • 2020-12-25 15:33

    In case if you have HS plugin and trying to CREATE or ALTER table which was already attempted to assess via HS you will face with similar problem and you have to restart HS plugin in this way to release table metadata lock:

    UNINSTALL PLUGIN HANDLERSOCKET;
    INSTALL PLUGIN HANDLERSOCKET SONAME 'handlersocket.so';
    
    0 讨论(0)
  • 2020-12-25 15:37

    Kill the connection with lock

    Kill 1398
    

    Then check if you have autocommit set to 0 by

    select @@autocommit;
    

    If yes, you propably forgot to commit transaction. Then another connection want to do something with this table, which causes the lock.

    In your case: If you made some query to LogEntries (whitch exists) and did not commit it, then you try to execute CREATE TABLE IF NOT EXISTS from another connection - metadata lock happens.

    edit For me the bug is somewhere at your application. Check there, or set autocommit to 1 if your not using transactions in application.

    ps also check this posts:

    • not a bug on bugs.mysql
    • replication of lock on MyISAM and INNOdb tables
    0 讨论(0)
提交回复
热议问题