MongoDB replication timeout

北城余情 提交于 2019-12-12 04:01:42

问题


I use MongoDB 3.4.3 and have three machines in one replica set. Let its names as server1, server2 and server3. server2 is in a constant rollback state, so we turned it off. server3 is in recovering state and tries to get oplog from server1 but its attempts result in ExceededTimeLimit exception. So this is an extract from the server3 log:

2017-06-26T14:42:14.442+0300 I REPL     [replication-0] could not find member to sync from
2017-06-26T14:42:24.443+0300 I REPL     [rsBackgroundSync] sync source candidate: server1:27017
2017-06-26T14:42:24.444+0300 I ASIO     [NetworkInterfaceASIO-RS-0] Connecting to server1:27017
2017-06-26T14:42:24.455+0300 I ASIO     [NetworkInterfaceASIO-RS-0] Successfully connected to server1:27017
2017-06-26T14:42:54.459+0300 I REPL     [replication-0] Blacklisting server1:27017 due to required optime fetcher error: 'ExceededTimeLimit: Operation timed out, request was RemoteCommand 191739 -- server1:27017 db:local expDate:2017-06-26T14:42:54.459+0300 cmd:{ find: "oplog.rs", oplogReplay: true, filter: { ts: { $gte: Timestamp 1497975676000|310, $lte: Timestamp 1497975676000|310 } } }' for 10s until: 2017-06-26T14:43:04.459+0300. required optime: { ts: Timestamp 1497975676000|310, t: 20 }

So these attepms to retrieve oplog are infinite. According to db.currentOp() there are a log of long running queries on the server1 (the primary of the replica set) trying to retrieve the oplog. These queries descreases perfomance of server1, so my database works very very slow.

The current server1's oplog size is 643 GB. I think its size is the reason why the replication doesn't work. server2 had had oplog timeout issues as well, so we turned it off temporarily. This sutiation has been lasting for more than week. I have more than 5 TB of data on the primary machine. How can I restore the replica set?

upd: Our servers have 64 GB of memory each. It's virtual machines indeed.


回答1:


Can you have downtime? Because it looks like that your machine (server1) don't have enough memory. With 5TB data and that big opLog, needed memory amount is hundreds of GB. I would not try to run that system as one replica set. More like 3-5 shards cluster (totally 9-15 nodes; replica set of 3 for every shard). Good rule is keep node size always under 2TB and 1TB is good starting point if you can archive that.

If you can have downtime, you should shrink your opLog to more reasonable size. You could start with 50GB. Steps can be found here.



来源:https://stackoverflow.com/questions/44798577/mongodb-replication-timeout

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!