This for small payloads.
I am looking to achieve 1,000,000,000 per 100ms.
The standard BinaryFormatter is very slow. The DataContractSerializer is slow than Bina
Proto-Buff is really quick but has got limitatins. => http://code.google.com/p/protobuf-net/wiki/Performance
In my experience, Marc's Protocol Buffers implementation is very good. I haven't used Jon's. However, you should be trying to use techniques to minimise the data and not serialise the whole lot.
I would have a look at the following.
If the messages are small you should look at what entropy you have. You may have fields that can be partially or completely be de-duplicated. If the communication is between two parties only you may get benefits from building a dictionary both ends.
You are using TCP which has an overhead enough without a payload on top. You should minimise this by batching your messages in to larger bundles and/or look at UDP instead. Batching itself when combined with #1 may get you closer to your requirement when you average your total communication out.
Is the full data width of double required or is it for convenience? If the extra bits are not used this will be a chance for optimisation when converting to a binary stream.
Generally generic serialisation is great when you have multiple messages you have to handle over a single interface or you don't know the full implementation details. In this case it would probably be better to build your own serialisation methods to convert a single message structure directly to byte arrays. Since you know the full implementation both sides direct conversion won't be a problem. It would also ensure that you can inline the code and prevent box/unboxing as much as possible.
If you don't want to take the time to implement a comprehensive explicit serialization/de-serialization mechanism, try this: http://james.newtonking.com/json/help/html/JsonNetVsDotNetSerializers.htm ...
In my usage with large objects (1GB+ when serialized to disk) I find that the file generated by the NewtonSoft library is 4.5 times smaller and takes 6 times fewer seconds to process than when using the BinaryFormatter.
I'd have expected Protobuf-net to be faster even for small objects... but you may want to try my Protocol Buffer port as well. I haven't used Marc's port for a while - mine was faster when I last benchmarked, but I'm aware that he's gone through a complete rewrite since then :)
I doubt that you'll achieve serializing a billion items in 100ms whatever you do though... I think that's simply an unreasonable expectation, especially if this is writing to disk. (Obviously if you're simply overwriting the same bit of memory repeatedly you'll get a lot better performance than serializing to disk, but I doubt that's really what you're trying to do.)
If you can give us more context, we may be able to help more. Are you able to spread the load out over multiple machines, for example? (Multiple cores serializing to the same IO device is unlikely to help, as I wouldn't expect this to be a CPU-bound operation if it's writing to a disk or the network.)
EDIT: Suppose each object is 10 doubles (8 bytes each) with a ulong identifier (4 bytes). That's 84 bytes per object at minimum. So you're trying to serialize 8.4GB in 100ms. I really don't think that's achievable, whatever you use.
I'm running my Protocol Buffers benchmarks now (they give bytes serialized per second) but I highly doubt they'll give you what you want.
You claim small items are slower than BinaryFormatter, but every time I'e measured it I've found the exact opposite, for example:
Performance Tests of Serializations used by WCF Bindings
I conclude, especially with the v2 code, that this may well be your fastest option. If you can post your specific benchmark scenario I'll happily help see what is "up"... If you can't post it here, if you want to email it to me directly (see profile) that would be OK too. I don't know if your stated timings are possible under any scheme, but I'm very sure I can get you a lot faster than whatever you are seeing.
With the v2 code, the CompileInPlace
gives the fastest result - it allows some IL tricks that it can't use if compiling to a physical dll.
This is the FASTEST approach i'm aware of. It does have its drawbacks. Like a rocket, you wouldn't want it on your car, but it has its place. Like you need to setup your structs and have that same struct on both ends of your pipe. The struct needs to be a fixed size, or it gets more complicated then this example.
Here is the perf I get on my machine (i7 920, 12gb ram) Release mode, without debugger attached. It uses 100% cpu during the test, so this test is CPU bound.
Finished in 3421ms, Processed 52.15 GB
For data write rate of 15.25 GB/s
Round trip passed
.. and the code...
class Program
{
unsafe
static void Main(string[] args)
{
int arraySize = 100;
int iterations = 10000000;
ms[] msa = new ms[arraySize];
for (int i = 0; i < arraySize; i++)
{
msa[i].d1 = i + .1d;
msa[i].d2 = i + .2d;
msa[i].d3 = i + .3d;
msa[i].d4 = i + .4d;
msa[i].d5 = i + .5d;
msa[i].d6 = i + .6d;
msa[i].d7 = i + .7d;
}
int sizeOfms = Marshal.SizeOf(typeof(ms));
byte[] bytes = new byte[arraySize * sizeOfms];
TestPerf(arraySize, iterations, msa, sizeOfms, bytes);
// lets round trip it.
var msa2 = new ms[arraySize]; // Array of structs we want to push the bytes into
var handle2 = GCHandle.Alloc(msa2, GCHandleType.Pinned);// get handle to that array
Marshal.Copy(bytes, 0, handle2.AddrOfPinnedObject(), bytes.Length);// do the copy
handle2.Free();// cleanup the handle
// assert that we didnt lose any data.
var passed = true;
for (int i = 0; i < arraySize; i++)
{
if(msa[i].d1 != msa2[i].d1
||msa[i].d1 != msa2[i].d1
||msa[i].d1 != msa2[i].d1
||msa[i].d1 != msa2[i].d1
||msa[i].d1 != msa2[i].d1
||msa[i].d1 != msa2[i].d1
||msa[i].d1 != msa2[i].d1)
{passed = false;
break;
}
}
Console.WriteLine("Round trip {0}",passed?"passed":"failed");
}
unsafe private static void TestPerf(int arraySize, int iterations, ms[] msa, int sizeOfms, byte[] bytes)
{
// start benchmark.
var sw = Stopwatch.StartNew();
// this cheats a little bit and reuses the same buffer
// for each thread, which would not work IRL
var plr = Parallel.For(0, iterations/1000, i => // Just to be nice to the task pool, chunk tasks into 1000s
{
for (int j = 0; j < 1000; j++)
{
// get a handle to the struc[] we want to copy from
var handle = GCHandle.Alloc(msa, GCHandleType.Pinned);
Marshal.Copy(handle.AddrOfPinnedObject(), bytes, 0, bytes.Length);// Copy from it
handle.Free();// clean up the handle
// Here you would want to write to some buffer or something :)
}
});
// Stop benchmark
sw.Stop();
var size = arraySize * sizeOfms * (double)iterations / 1024 / 1024 / 1024d; // convert to GB from Bytes
Console.WriteLine("Finished in {0}ms, Processed {1:N} GB", sw.ElapsedMilliseconds, size);
Console.WriteLine("For data write rate of {0:N} GB/s", size / (sw.ElapsedMilliseconds / 1000d));
}
}
[StructLayout(LayoutKind.Explicit, Size= 56, Pack=1)]
struct ms
{
[FieldOffset(0)]
public double d1;
[FieldOffset(8)]
public double d2;
[FieldOffset(16)]
public double d3;
[FieldOffset(24)]
public double d4;
[FieldOffset(32)]
public double d5;
[FieldOffset(40)]
public double d6;
[FieldOffset(48)]
public double d7;
}