I am running a django app with nginx & uwsgi. Here\'s how i run uwsgi:
sudo uwsgi -b 25000 --chdir=/www/python/apps/pyapp --module=wsgi:application --env DJA
EDIT 1 Seen the comment that you have 1 virtual core, adding commentary through on all relavant points
EDIT 2 More information from Maverick, so I'm eliminating ideas ruled out and developing the confirmed issues.
EDIT 3 Filled out more details about uwsgi request queue and scaling options. Improved grammar.
EDIT 4 Updates from Maverick and minor improvements
Comments are too small, so here are some thoughts:
Running at a concurrency of 100, the tests take 41 seconds at 24 requests/sec. You have 16 processes (threads), so each request is processed about 700ms. Given your type of transaction, that is a long time per request. This may be because:
There is considerable latency around the DB, either:
The split between system and user CPU cost is unusually high in system, although the total CPU is low. This implies that most of the work in Django is kernel related, such as networking or disk. In this scenario, it might be network costs (eg receiving and sending HTTP requests and receiving and sending requests to the DB). Sometimes this will be high because of paging. If there's no paging going on, then you probably don't have to worry about this at all.
So, in summary:
There is a large latency, so either
Mismatch between processes (16) and CPU cores (1): If the load average is >3, then it's probably too many processes. Try again with a smaller number of processes
If the load average <3, it may be in the DB, so profile the DB to see whether there are loads of small requests (additively causing the latency) or one or two SQL statements are the problem
Developing ideas
Your load averages >10 on a single cored machine is really nasty and (as you observe) leads to a lot of task switching and general slow behaviour. I personally don't remember seeing a machine with a load average of 19 (which you have for 16 processes) - congratulations for getting it so high ;)
The DB performance is great, so I'd give that an all-clear right now.
Paging: To answer you question on how to see paging - you can detect OS paging in several ways. For example, in top, the header has page-ins and outs (see the last line):
Processes: 170 total, 3 running, 4 stuck, 163 sleeping, 927 threads 15:06:31 Load Avg: 0.90, 1.19, 1.94 CPU usage: 1.37% user, 2.97% sys, 95.65% idle SharedLibs: 144M resident, 0B data, 24M linkedit. MemRegions: 31726 total, 2541M resident, 120M private, 817M shared. PhysMem: 1420M wired, 3548M active, 1703M inactive, 6671M used, 1514M free. VM: 392G vsize, 1286M framework vsize, 1534241(0) pageins, 0(0) pageouts. Networks: packets: 789684/288M in, 912863/482M out. Disks: 739807/15G read, 996745/24G written.
Number of processes: In your current configuration, the number of processes is way too high. Scale the number of processes back to a 2. We might bring this value up later, depending on shifting further load off this server.
Location of Apache Benchmark: The load average of 1.85 for one process suggests to me that you are running the load generator on the same machine as uwsgi - is that correct?
If so, you really need to run this from another machine otherwise the test runs are not representative of actual load - you're taking memory and CPU from the web processes for use in the load generator. In addition, the load generator's 100 or 500 threads will generally stress your server in a way that does not happen in real life. Indeed this might be the reason the whole test fails.
Location of the DB: The load average for one process also suggest that you are running the DB on the same machine as the web processes - is this correct?
If I'm correct about the DB, then the first and best way to start scaling is to move the DB to another machine. We do this for a couple of reasons:
A DB server needs a different hardware profile from a processing node:
CPU affinity: It's better for the CPU to have a load average of 1.0 and processes to have affinity to a single core. Doing so maximizes the use of the CPU cache and minimizes task switching overheads. By separating the DB and processing nodes, you are enforcing this affinity in HW.
500 concurrency with exceptions The request queue in the diagram above is at most 100 - if uwsgi receives a request when the queue is full, the request is rejected with a 5xx error. I think this was happening in your 500 concurrency load test - basically the queue filled up with the first 100 or so threads, then the other 400 threads issued the remaining 900 requests and received immediate 5xx errors.
To handle 500 requests per second you need to ensure two things:
--listen
argument to uwsgi
I imagine that uwsgi has the queue set to a smaller number to better handle DDoS attacks; if placed under huge load, most requests immediately fail with almost no processing allowing the box as a whole to still be responsive to the administrators.
General advice for scaling a system
Your most important consideration is probably to maximize throughput. Another possible need to minimize response time, but I won't discuss this here. In maximising throughput, you are trying to maximize the system, not individual components; some local decreases might improve overall system throughput (for example, making a change that happens to add latency in the web tier in order to improve performance of the DB is a net gain).
Onto specifics:
top
and your favorite MySQL monitoring tool. You need to be able to profile . Moving the DB to a separate machine will introduce some additional latency (several ms) per request, so expect to slightly increase the number of processes at the web tier to keep the same throughput.uswgi
request queue is large enough to handle a burst of traffic using the --listen
argument. This should be several times the maximum steady-state requests-per-second your system can handle.On the web/app tier: Balance the number of processes with the number of CPU cores and the inherent latency in the process. Too many processes slows performance, too few means that you'll never fully utilize the system resources. There is no fixed balancing point, as every application and usage pattern is different, so benchmark and adjust. As a guide, use the processes' latency, if each task has:
Check top
during the test to ensure that you are above 90% cpu utilisation (for every core) and you have a load average a little above 1.0. If the load average is higher, scale back the processes. If all goes well, at some point you won't be able to achieve this target, and DB might now be the bottleneck
Adding more workers and getting less r/s means that your request "is pure CPU" and there is no IO waits that another worker can use to serve another request.
If you want to scale you will need to use another server with more (or faster) cpu's.
However this is a synthetic test, the number of r/s you get are the upper bound for the exact request that you are testing, once on production there are many more variables that can affect the performance.
Please run benchmarks much longer than a minute (5-10 at least), You really won't get much information from such a short test. And use uWSGI's carbon plugin to push stats to carbon/graphite server (You will need to have one), You will have much more information for debugging.
When You send 500 concurrent requests to Your app and it can't handle such load, listen queue on each backend will be filled pretty quickly (it's 100 requests by default), You might want to increase that, but if workers can't process requests that fast and listen queue (also known as backlog) is full, linux network stack will drop request and You will start getting errors.
Your first benchmark states that You can process single request in ~42 ms, so single worker could process at most 1000ms / 42ms = ~23 requests per second (if db and other parts of app stack didn't slow down as concurrency goes up). So to process 500 concurrent requests You would need at least 500 / 23 = 21 workers (but in reality I would say at least 40), You have only 16, no wonder it breaks under such load.
EDIT: I've mixed rate with concurrency - at least 21 workers will allow You to process 500 requests per second, not 500 concurrent requests. If You really want to handle 500 concurrent requests than You simply need 500 workers. Unless You will run Your app in async mode, check "Gevent" section in uWSGI docs.
PS. uWSGI comes with great load balancer with backend autoconfiguration (read docs under "Subscription Server" and "FastRouter"). You can setup it in a way that allows You to hot-plug new backend as needed, You just start workers on new node and they will subscribe to FastRouter and start getting requests. This is the best way to scale horizontally. And with backends on AWS You can automate this so that new backends will be started quickly when needed.