I am running a django app with nginx & uwsgi. Here\'s how i run uwsgi:
sudo uwsgi -b 25000 --chdir=/www/python/apps/pyapp --module=wsgi:application --env DJA
EDIT 1 Seen the comment that you have 1 virtual core, adding commentary through on all relavant points
EDIT 2 More information from Maverick, so I'm eliminating ideas ruled out and developing the confirmed issues.
EDIT 3 Filled out more details about uwsgi request queue and scaling options. Improved grammar.
EDIT 4 Updates from Maverick and minor improvements
Comments are too small, so here are some thoughts:
Running at a concurrency of 100, the tests take 41 seconds at 24 requests/sec. You have 16 processes (threads), so each request is processed about 700ms. Given your type of transaction, that is a long time per request. This may be because:
There is considerable latency around the DB, either:
The split between system and user CPU cost is unusually high in system, although the total CPU is low. This implies that most of the work in Django is kernel related, such as networking or disk. In this scenario, it might be network costs (eg receiving and sending HTTP requests and receiving and sending requests to the DB). Sometimes this will be high because of paging. If there's no paging going on, then you probably don't have to worry about this at all.
So, in summary:
There is a large latency, so either
Mismatch between processes (16) and CPU cores (1): If the load average is >3, then it's probably too many processes. Try again with a smaller number of processes
If the load average <3, it may be in the DB, so profile the DB to see whether there are loads of small requests (additively causing the latency) or one or two SQL statements are the problem
Developing ideas
Your load averages >10 on a single cored machine is really nasty and (as you observe) leads to a lot of task switching and general slow behaviour. I personally don't remember seeing a machine with a load average of 19 (which you have for 16 processes) - congratulations for getting it so high ;)
The DB performance is great, so I'd give that an all-clear right now.
Paging: To answer you question on how to see paging - you can detect OS paging in several ways. For example, in top, the header has page-ins and outs (see the last line):
Processes: 170 total, 3 running, 4 stuck, 163 sleeping, 927 threads 15:06:31 Load Avg: 0.90, 1.19, 1.94 CPU usage: 1.37% user, 2.97% sys, 95.65% idle SharedLibs: 144M resident, 0B data, 24M linkedit. MemRegions: 31726 total, 2541M resident, 120M private, 817M shared. PhysMem: 1420M wired, 3548M active, 1703M inactive, 6671M used, 1514M free. VM: 392G vsize, 1286M framework vsize, 1534241(0) pageins, 0(0) pageouts. Networks: packets: 789684/288M in, 912863/482M out. Disks: 739807/15G read, 996745/24G written.
Number of processes: In your current configuration, the number of processes is way too high. Scale the number of processes back to a 2. We might bring this value up later, depending on shifting further load off this server.
Location of Apache Benchmark: The load average of 1.85 for one process suggests to me that you are running the load generator on the same machine as uwsgi - is that correct?
If so, you really need to run this from another machine otherwise the test runs are not representative of actual load - you're taking memory and CPU from the web processes for use in the load generator. In addition, the load generator's 100 or 500 threads will generally stress your server in a way that does not happen in real life. Indeed this might be the reason the whole test fails.
Location of the DB: The load average for one process also suggest that you are running the DB on the same machine as the web processes - is this correct?
If I'm correct about the DB, then the first and best way to start scaling is to move the DB to another machine. We do this for a couple of reasons:
A DB server needs a different hardware profile from a processing node:
CPU affinity: It's better for the CPU to have a load average of 1.0 and processes to have affinity to a single core. Doing so maximizes the use of the CPU cache and minimizes task switching overheads. By separating the DB and processing nodes, you are enforcing this affinity in HW.
500 concurrency with exceptions The request queue in the diagram above is at most 100 - if uwsgi receives a request when the queue is full, the request is rejected with a 5xx error. I think this was happening in your 500 concurrency load test - basically the queue filled up with the first 100 or so threads, then the other 400 threads issued the remaining 900 requests and received immediate 5xx errors.
To handle 500 requests per second you need to ensure two things:
--listen
argument to uwsgi
I imagine that uwsgi has the queue set to a smaller number to better handle DDoS attacks; if placed under huge load, most requests immediately fail with almost no processing allowing the box as a whole to still be responsive to the administrators.
General advice for scaling a system
Your most important consideration is probably to maximize throughput. Another possible need to minimize response time, but I won't discuss this here. In maximising throughput, you are trying to maximize the system, not individual components; some local decreases might improve overall system throughput (for example, making a change that happens to add latency in the web tier in order to improve performance of the DB is a net gain).
Onto specifics:
top
and your favorite MySQL monitoring tool. You need to be able to profile . Moving the DB to a separate machine will introduce some additional latency (several ms) per request, so expect to slightly increase the number of processes at the web tier to keep the same throughput.uswgi
request queue is large enough to handle a burst of traffic using the --listen
argument. This should be several times the maximum steady-state requests-per-second your system can handle.On the web/app tier: Balance the number of processes with the number of CPU cores and the inherent latency in the process. Too many processes slows performance, too few means that you'll never fully utilize the system resources. There is no fixed balancing point, as every application and usage pattern is different, so benchmark and adjust. As a guide, use the processes' latency, if each task has:
Check top
during the test to ensure that you are above 90% cpu utilisation (for every core) and you have a load average a little above 1.0. If the load average is higher, scale back the processes. If all goes well, at some point you won't be able to achieve this target, and DB might now be the bottleneck