I\'m running performance tests against ATS and its behaving a bit weird when using multiple virtual machines against the same table / storage account.
The entire pipeline
I suspect this may have to do with TCP Nagle. See this MSDN article and this blog post.
In essence, TCP Nagle is a protocol-level optimization that batches up small requests. Since you are sending lots of small requests this is likely to negatively affect your performance.
You can disable TCP Nagle by executing this code when starting your application
ServicePointManager.UseNagleAlgorithm = false;
Are the compute instances and storage account in the same affinity group? Affinity groups ensure that network proximity between the services is optimal and should result in lower latency at the network level.
You can find affinity group configuration under the network tab.
I would tend to believe that the maximum throughput is for an optimized load. For example, I bet you that you can achieve higher performance using Batch requests than individual requests you are doing now. And of course, if you use GUIDs for your PK, you can't Batch in your current test.
So what if you changed your test to batch insert entities in groups of 100 (maximum per batch), still using GUIDs, but for which 100 entities would have the same PK?
A few comments:
You mention that you are using unique PK/RK to get ultimate distribution, but you have to keep in mind that the PK balancing is not immediate. When you first create a table, the entire table will be served by 1 partition server. So if you are doing inserts across several different PKs, they will still be going to one partition server and be bottlenecked by the scalability target for a single partition. The partition master will only start splitting your partitions among multiple partition servers after it has identified hot partition servers. In your <2 minute test you will not see the benefit of multiple partiton servers or PKs. The throughput in the article is targeted towards a well distributed PK scheme with frequently accessed data, causing the data to be divided amongst multiple partition servers.
The size of your VM is not the issue as you are not blocked on CPU, Memory, or Bandwidth. You can achieve full storage performance from a small VM size.
Check out http://research.microsoft.com/en-us/downloads/5c8189b9-53aa-4d6a-a086-013d927e15a7/default.aspx. I just now did a quick test using that tool from a WebRole VM in the same datacenter as my storage account and I acheived, from a single instance of the tool on a single VM, ~2800 items per second upload and ~7300 items per second download. This is using 1024 byte entities, 10 threads, and 100 batch size. I don't know how efficient this tool is or if it disables Nagles Algorithm as I was unable to get great results (I got ~1000/second) using a batch size of 1, but at least with the 100 batch size it shows that you can achieve high items/second. This was done in US West.
Are you using Storage client library 1.7 (Microsoft.Azure.StorageClient.dll) or 2.0 (Microsoft.Azure.Storage.dll)? The 2.0 library has some performance improvements and should yield better results.