问题
I thought the C/C++ vs C#/Java performance question was well trodden, meaning that I'd read enough evidence to suggest that the VM languages are not necessarily any slower than the "close-to-silicon" languages. Mostly because the JIT compiler can do optimizations that the statically compiled languages cannot.
However, I recently received a CV from a guy who claims that Java-based high frequency trading is always beaten by C++, and that he'd been in a situation where this was the case.
A quick browse on job sites indeed shows that HFT applicants need knowledge of C++, and a look at Wilmott forum shows all the practitioners talking about C++.
Is there any particular reason why this is the case? I would have thought that with modern financial business being somewhat complex, a VM language with type safety, managed memory, and a rich library would be preferred. Productivity is higher that way. Plus, JIT compilers are getting better and better. They can do optimizations as the program is running, so you'd think they's use that run-time info to beat the performance of the unmanaged program.
Perhaps these guys are writing the critical bits in C++ and and calling them from a managed environment (P/Invoke etc)? Is that possible?
Finally, does anyone have experience with the central question in this, which is why in this domain unmanaged code is without doubt preferred over managed?
As far as I can tell, the HFT guys need to react as fast as possible to incoming market data, but this is not necessarily a hard realtime requirement. You're worse off if you're slow, that's for sure, but you don't need to guarantee a certain speed on each response, you just need a fast average.
EDIT
Right, a couple of good answers thus far, but pretty general (well-trodden ground). Let me specify what kind of program HFT guys would be running.
The main criterion is responsiveness. When an order hits the market, you want to be the first to be able to react to it. If you're late, someone else might take it before you, but each firm has a slightly different strategy, so you might be OK if one iteration is a bit slow.
The program runs all day long, with almost no user intervention. Whatever function is handling each new piece of market data is run dozens (even hundreds) of times a second.
These firms generally have no limit as to how expensive the hardware is.
回答1:
Firstly, 1 ms is an eternity in HFT. If you think it is not then it would be good to do a bit more reading about the domain. (It is like being 100 miles away from the exchange.) Throughput and latency are deeply intertwined as the formulae in any elementary queuing theory textbook will tell you. The same formulae will show jitter values (frequently dominated by the standard deviation of CPU queue delay if the network fabric is right and you have not configured quite enough cores).
One of the problems with HFT arbitrage is that once you decide to capture a spread, there are two legs (or more) to realize the profit. If you fail to hit all legs you can be left with a position that you really don't want (and a subsequent loss) - after all you were arbitraging not investing.
You don't want positions unless your strategy is predicting the (VERY near term!!!) future (and this, believe it or not, is done VERY successfully). If you are 1 ms away from exchange then some significant fraction of your orders won't be executed and what you wanted will be picked off. Most likely the ones that have executed one leg will end up losers or at least not profitable.
Whatever your strategy is for argument's sake let us say it ends up a 55%/45% win/loss ratio. Even a small change in the win/loss ratio can have in big change in profitability.
re: "run dozens (even hundreds)" seems off by orders of magnitude Even looking at 20000 ticks a second seems low, though this might be the average for the entire day for the instrument set that he is looking at.
There is high variability in the rates seen in any given second. I will give an example. In some of my testing I look at 7 OTC stocks (CSCO,GOOG,MSFT,EBAY,AAPL,INTC,DELL) in the middle of the day the per second rates for this stream can range from 0 mps (very very rare) to almost almost 2000 quotes and trades per peak second. (see why I think the 20000 above is low.)
I build infrastructure and measurement software for this domain and the numbers we talk about are 100000's and millions per second. I have C++ producer/consumer infrastructure libraries that can push almost 5000000 (5 million) messages/second between producer and consumer, (32 bit,2.4 GHz cores). These are 64 byte messages with new, construct, enqueue, synchronize, on the producer side and synchronize,dequeue,touch every byte,run virtual destructor,free on the consumer side. Now admittedly that is a simple benchmark with no Socket IO (and socket IO can be ugly) as would be at the end points of the end point pipe stages. It is ALL custom synchronization classes that only synchronize when empty, custom allocators, custom lock free queues and lists, occasional STL(with custom allocators) but more often custom intrusive collections (of which I have a significant library). More than once I have given a vendor in this arena a quadruple (and more) in throughput without increased batching at the socket endpoints.
I have OrderBook and OrderBook::Universe classes that take less than 2us for new, insert, find, partial fill, find, second fill, erase, delete sequence when averaged over 22000 instruments. The benchmark iterates over all 22000 instruments serially between the insert first fill and last fill so there are no cheap caching tricks involved. Operations into the same book are separated by accesses of 22000 different books. These are very much NOT the caching characteristics of real data. Real data is much more localized in time and consecutive trades frequently hit the same book.
All of this work involves careful consideration of the constants and caching characteristics in any of the algorithmic costs of the collections used. (Sometimes it seems that the K's in KO(n) KO(n*log n) etc., etc., etc. are dismissed a bit too glibly)
I work on the Marketdata infrastructure side of things. It is inconceivable to even think of using java or a managed environment for this work. And when you can get this kind of performance with C++ and I think it is quite hard to get million+/mps performance with a managed environment) I can't imagine any of the significant investment banks or hedge funds (for whom a $250000 salary for a top notch C++ programmer is nothing) not going with C++.
Is anybody out there really getting 2000000+/mps performance out of a managed environment? I know a few people in this arena and no one ever bragged about it to me. And I think 2mm in a managed environment would have some bragging rights.
I know of one major player's FIX order decoder doing 12000000 field decodes/sec. (3Ghz CPU) It is C++ and the guy who wrote it almost challenged anybody to come up with something in a managed environment that is even half that speed.
Technologically it is an interesting area with lots of fun performance challenges. Consider the options market when the underlying security changes - there might be say 6 outstanding price points with 3 or 4 different expiration dates. Now for each trade there were probably 10-20 quotes. Those quotes can trigger price changes in the options. So for each trade there might be 100 or 200 changes in options quotes. It is just a ton of data - not a Large Hadron Collider collision-detector-like amount of data but still a bit of a challenge. It is a bit different than dealing with keystrokes.
Even the debate about FPGA's goes on. Many people take the position that a well coded parser running on 3GHZ commodity HW can beat a 500MHz FPGA. But even if a tiny bit slower (not saying they are) FPGA based systems can tend to have tighter delay distributions. (Read "tend" - this is not a blanket statement) Of course if you have a great C++ parser that you push through a Cfront and then push that through the FPGA image generator... But that another debate...
回答2:
A lot of it comes down to a simple difference between fact and theory. People have advanced theories to explain why Java should be (or at least might be) faster than C++. Most of the arguments have little to do with Java or C++ per se, but to dynamic versus static compilation, with Java and C++ really being little more than examples of the two (though, of course, it's possible to compile Java statically, or C++ dynamically). Most of these people have benchmarks to "prove" their claim. When those benchmarks are examined in any detail, it quickly becomes obvious that in quite a few cases, they took rather extreme measures to get the results they wanted (e.g., quite a number enable optimization when compiling the Java, but specifically disabled optimization when compiling the C++).
Compare this to the Computer Language Benchmarks Game, where pretty much anybody can submit an entry, so all of the code tends to be optimized to a reasonable degree (and, in a few cases, even an unreasonable degree). It seems pretty clear that a fair number of people treat this as essentially a competition, with advocates of each language doing their best to "prove" that their preferred language is best. Since anybody can submit an implementation of any problem, a particularly poor submission has little effect on overall results. In this situation, C and C++ emerge as clear leaders.
Worse, if anything these results probably show Java in better light than is entirely accurate. In particular, somebody who uses C or C++ and really cares about performance can (and often will) use Intel's compiler instead of g++. This will typically give at least a 20% improvement in speed compared to g++.
Edit (in response to a couple of points raised by jalf, but really too long to fit reasonably in comments):
pointers being an optimizer writers nightmare. This is really overstating things (quite) a bit. Pointers lead to the possibility of aliasing, which prevents certain optimizations under certain circumstances. That said, inlining prevents the ill effects much of the time (i.e., the compiler can detect whether there's aliasing rather than always generating code under the assumption that there could be). Even when the code does have to assume aliasing, caching minimizes the performance hit from doing so (i.e., data in L1 cache is only minutely slower than data in a register). Preventing aliasing would help performance in C++, but not nearly as much as you might think.
Allocation being a lot faster with a garbage collector. It's certainly true that the default allocator in many C++ implementations is slower than what most (current) garbage collected allocators provide. This is balanced (to at least a degree) by the fact that allocations in C++ tend to be on the stack, which is also fast, whereas in a GC language nearly all allocations are usually on the heap. Worse, in a managed language you usually allocate space for each object individually whereas in C++ you're normally allocating space for all the objects in a scope together.
It's also true that C++ directly supports replacing allocators both globally and on a class-by-class basis, so when/if allocation speed really is a problem it's usually fairly easy to fix.
Ultimately, jalf is right: both of these points undoubtedly do favor "managed" implementations. The degree of that improvement should be kept in perspective though: they're not enough to let dynamically compiled implementations run faster on much code -- not even benchmarks designed from the beginning to favor them as much as possible.
Edit2: I see Jon Harrop has attempted to insert his two (billionths of a) cent's worth. For those who don't know him, Jon's been a notorious troll and spammer for years, and seems to be looking for new ground into which to sow weeds. I'd try to reply to his comment in detail, but (as is typical for him) it consists solely of unqualified, unsupported generalizations containing so little actual content that a meaningful reply is impossible. About all that can be done is to give onlookers fair warning that he's become well known for being dishonest, self-serving, and best ignored.
回答3:
A JIT compiler could theoretically perform a lot of optimizations, yes, but how long are you willing to wait? A C++ app can take hours to compiler because it happens offline, and the user isn't sitting there tapping his fingers and waiting.
A JIT compiler has to finish within a couple of milliseconds. So which do you think can get away with the most complex optimizations?
The garbage collector is a factor too. Not because it is slower than manual memory management per se (I believe its amortized cost is pretty good, definitely comparable to manual memory handling), but it's less predictable. It can introduce a stall at pretty much any point, which might not be acceptable in systems that are required to be extremely responsive.
And of course, the languages lend themselves to different optimizations. C++ allows you to write very tight code, with virtually no memory overhead, and where a lot of high level operations are basically free (say, class construction).
In C# on the other hand, you waste a good chunk of memory. And simply instantiating a class carries a good chunk of overhead, as the base Object
has to be initialized, even if your actual class is empty.
C++ allows the compiler to strip away unused code aggressively. In C#, most of it has to be there so it can be found with reflection.
On the other hand, C# doesn't have pointers, which are an optimizing compiler's nightmare. And memory allocations in a managed language are far cheaper than in C++.
There are advantages either way, so it is naive to expect that you can get a simple "one or the other" answer. Depending on the exact source code, the compiler, the OS, the hardware it's running on, one or the other may be faster. And depending on your needs, raw performance might not be the #1 goal. Perhaps you're more interested in responsiveness, in avoiding unpredictable stalls.
In general, your typical C++ code will perform similarly to equivalent C# code. Sometimes faster, sometimes slower, but probably not a dramatic difference either way.
But again, it depends on the exact circumstances. And it depends on how much time you're willing to spend on optimization. if you're willing to spend as much time as it takes, C++ code can usually achieve better performance than C#. It just takes a lot of work.
And the other reason, of course, is that most companies who use C++ already have a large C++ code base which they don't particularly want to ditch. They need that to keep working, even if they gradually migrate (some) new components to a managed language.
回答4:
These firms generally have no limit as to how expensive the hardware is.
If they also don't care how expensive the sofware is, then I'd think that of course C++ can be faster: for example, the programmer might use custom-allocated or pre-allocated memory; and/or they can run code in the kernel (avoiding ring transitions), or on a real-time O/S, and/or have it closely-coupled to the network protocol stack.
回答5:
There are reasons to use C++ other than performance. There is a HUGE existing library of C and C++ code. Rewriting all of that in alternate languages would not be practical. In order for things like P/Invoke to work correctly, the target code has to be designed to be called from elsewhere. If nothing else you'd have to write some sort of wrapper around things exposing a completely C API because you can't P/Invoke to C++ classes.
Finally, P/Invoke is a very expensive operation.
JIT compilers are getting better and better. They can do optimizations as the program is running
Yes, they can do this. But you forget that any C++ compiler is able to do the same optimizations. Sure, compile time will be worse, but the very fact that such optimizations have to be done at runtime is overhead. There are cases where managed languages can beat C++ at certain tasks, but this is usually because of their memory models and not the result of runtime optimizations. Strictly speaking, you could of course have such a memory model in C++, EDIT: such as C#'s handling of strings, /EDIT but few C++ programmers spend as much time optimizing their code as JIT guys do.
There are some performance issues that are an inherit downside to managed languages -- namely disk I/O. It's a one time cost, but depending on the application it can be significant. Even with the best optimizers, you still need to load 30MB+ of JIT compiler from disk when the program starts; whereas it's rare for a C++ binary to approach that size.
回答6:
The simple fact is that C++ is designed for speed. C#/Java aren't.
Take the innumerable inheritance hierarchies endemic to those languages (such as IEnumerable), compared to the zero-overhead of std::sort or std::for_each being generic. C++'s raw execution speed isn't necessarily any faster, but the programmer can design fast or zero-overhead systems. Even things like buffer overruns- you can't turn their detection off. In C++, you have control. Fundamentally, C++ is a fast language- you don't pay for what you don't use. In contrast, in C#, if you use, say, stackalloc, you can't NOT do buffer overrun checking. You can't allocate classes on the stack, or contiguously.
There's also the whole compile-time thing, where C++ apps can take much longer, both to compile, and to develop.
回答7:
This might be kinda off topic, but I watched a video a couple of weeks ago which might appear to be of interest to you : http://ocaml.janestreet.com/?q=node/61
It comes from a trading company which decided to use ocaml as its main language for trading, and I think their motivations should be enlightening to you (basically, they valued speed of course, but also strong typing and functional style for quicker increments as well as easier comprehension).
回答8:
Most of our code ends up having to be run on a Grid of 1000's of machines.
I think this environment changes the argument. If the difference between c++ and c# execution speed is 25% for example then other factors come into play. When this is run on a grid it may make no difference as to how it is coded as the whole process once spread across machines may not be an issue or solved by allocating or purchasing a few more machines. The most important issue and cost may become 'time to market' where c# may prove the winner and faster option.
Which is faster c++ or c#?
C# by six months......
回答9:
It's not only a matter of programming language, the hardware and operating system will be relevant to.
The best overall performance you will get with a realtime operating system, a realtime programming language and efficient (!) programming.
So you've quite a few possibilities in choosing an operating system, and a few in choosing the language. There's C, Realtime Java, Realtime Fortran and a few others.
Or maybe you will have the best results in programming an FPGA/Processor to eliminate the cost of an operating system.
The greatest choice you have to do, how many possible performance optimizations you will ignore in favor of choosing a language that eases development and will run more stable, because you can do less bugs, which will result in a higher availiability of the system. This shouldn't be overlooked. You have no win in developing an application which performs 5% faster than any other application which crashes every few point due to some minor hard to find bugs.
回答10:
In HFT, latency is a bigger issue that throughput. Given the inherent parallelism in the data source, you can always throw more cores at the problem, but you can't make up for response time with more hardware. Whether the language is compiled beforehand, or Just-In-Time, garbage collection can destroy your latency. There exist realtime JVMs with guaranteed garbage collection latency. It's a fairly new technology, a pain to tune, and ridiculously expensive, but if you have the resources, it can be done. It'll probably become much more mainstream in coming years, as the early adopters fund the R&D that's going on now.
回答11:
One of the most interesting thing in C++ is that its performance numbers are not better, but more reliable.
It's not necessarily faster than Java/C#/..., but it is consistent accross runs.
Like in networking, sometimes the throughput isn't as important as a stable latency.
回答12:
A huge reason to prefer c++ (or lower level) in this case other than what has already been said, is that there are some adaptability benefits of being low level.
If hardware technology changes, you can always drop into an __asm { }
block and actually use it before languages/compilers catch up
For example, there is still no support for SIMD in Java.
回答13:
Virtual Execution Engines (JVM or CLR of .Net) do not permit structuring the work in time-efficient way, as process instances cannot run on as many threads as might be needed.
In contrast, plain C++ enables execution of parallel algorithms and construction of objects outside the time-critical execution paths. That’s pretty much everything – simple and elegant. Plus, with C++ you pay only for what you use.
回答14:
The elephant in the room here is the fact that C++ is faster than Java.
We all know it. But we also know that if we state it plainly, as I just did, that we can't pretend to engage in a meaningful debate about this undebatable topic. How much faster is C++ than Java for your application? That has the ring of a debatable topic, but, alas, it will always be hypothetical unless you implement your application in both languages, at which point it there will be no room for debate.
Let's go back to your first design meeting: The hard requirement for your project is high performance. Everyone in the room will think "C++" and a handful of other compiled languages. The guy in the room who suggests Java or C# will have to justify it with evidence (i.e., a prototype), not with hypotheticals, not with claims made by the vendors, not with statements on programmer gossip sites, and certainly not with "hello world" benchmarks.
As it stands now, you have to move forward with what you know, not with what is hypothetically possible.
回答15:
Nikie wrote: “Could you explain what you can do with C++ threads and not with e.g. .NET threads?”
Threading with .Net could perform virtually everything C++ threading can, except:
- Efficient execution of COM-encapsulated binary code. For examples, sensitive algorithms that might have to be kept secret from application developers. (Might be relevant in HFT)
- Creation of lean threads that do not exhaust system resources with chunky building blocks – wrapped OS APIs and synchronization & signaling OS primitives. (Extremely relevant with parallel algorithms for time-optimization of performance in HFT)
- Scaling up the throughput of a business process application 10 or more times on the same hardware and with the same latency. (Not relevant in HFT)
- Scaling up 100 and more times the number of concurrently handled user interactions per unit of hardware. (Not relevant in HFT)
Using more CPU cores cannot fully compensate exhausting of system resources by the building blocks of .Net since more CPU cores are a guarantee for appearance of memory contention.
来源:https://stackoverflow.com/questions/3175072/performance-of-c-vs-virtual-machine-languages-in-high-frequency-finance