This problem is killing the stability of my production servers.
To recap, the basic idea is that my node server(s) sometimes intermittently slow down, sometimes resultin
You are heavily leaking memory, try setting every object to null as soon as you don't need it anymore! Read this.
More information about hunting down memory leaks can be found here.
Give special attention to having multiple references to the same object and check if you have circular references, those are a pain to debug but will help you very much.
Try invoking the garbage collector manually every minute or so (I don't know if you can do this in node.js cause I'm more of a c++ and php coder). From my years of experience working with c++ I can tell you the most likely cause of your application slowing down over time is memory leaks, find them and plug them, you'll be ok!
Also assuming you're not caching and/or processing images, audio or video in memory or anything like that 150M heap is a lot! Those could be hundreds of thousands or even millions of small objects.
You don't have to be running out of memory for your application to slow down... just searching for free memory with that many objects already allocated is a huge job for the memory allocator, it takes a lot of time to allocate each new object and as you leak more and more memory that time only increases.
If you need to get this working now, you can go the NASA redundancy route:
Bring up a second copy of your production servers, and put a proxy in front of them which routes each request to both stacks and returns the first response. I don't recommend this as a perfect long-term solution but it should help significantly reduce issues in production now, and help you gather log data that you could replay to recreate the issues on non-production servers.
Obviously, this is straight-forward for read requests, but more complex for commands which write to the db.
My guess is Mongoose. If you are storing large payloads in Mongo, Mongoose can be pretty slow due to how it builds the Mongoose objects. See https://github.com/LearnBoost/mongoose/issues/950 for more details on the problem. If this is the problem you wouldn't see it in Mongo itself since the query returns quickly, but object instantiation could take 75x the query time.
Try setting up timers around (process.hrtime()
) before and after you the Mongoose objects are being created to see if that might be the problem. If this is the problem, I would switch to using the node Mongo driver directly instead of going through Mongoose.
Is "--nouse-idle-connection" a mistake? do you really mean "--nouse_idle_notification".
I think it's maybe some issues about gc with too many tiny objects. node is single process, so watch the most busy cpu core is much important than the load. when your program is slow, you can execute "gdb node pid" and "bt" to see what node is busy doing.
We have a similar problem with our Node.js server. It didn't scale well for weeks and we had tried almost everything as you had. Our problem was in the implicit backlog value which is set very low for high-concurrent environments.
http://nodejs.org/api/http.html#http_server_listen_port_hostname_backlog_callback
Setting the backlog to a significantly higher value (e.g. 10000) as well as tune networking in our kernel (/etc/sysctl.conf on Linux) as described in manual section helped a lot. From this time forward we don't have any timeouts in our Node.js server.
Many months after I first asked this question, I found the answer.
In a nutshell, the problem was that I was not piping a big asset when transferring it from one server to another. In other words, I was downloading an image from one server, before uploading it to a S3 bucket. Instead of streaming the download into the upload, I downloaded the file into memory, and then uploaded it.
I'm not sure why this did not show up as a memory spike, or elsewhere in my statistics.