Setting aside advanced optimization techniques such as hot-spot optimization, pre-compiled meta-algorithms, and various forms of parallelism, the fundamental speed of a language correlates strongly with the implicit behind-the-scenes complexity required to support the operations that would commonly be specified within inner loops.
Perhaps the most obvious is validity checking on indirect memory references -- such as checking pointers for null
and checking indexes against array boundaries. Most high-level languages perform these checks implicitly, but C does not. However, this is not necessarily a fundamental limitation of these other languages -- a sufficiently clever compiler may be capable of removing these checks from the inner loops of an algorithm through some form of loop-invariant code motion.
The more fundamental advantage of C (and to a similar extent the closely related C++) is a heavy reliance on stack-based memory allocation, which is inherently fast for allocation, deallocation, and access. In C (and C++) the primary call stack can be used for allocation of primitives, arrays, and aggregates (struct
/class
).
While C does offer the capability to dynamically allocate memory of arbitrary size and lifetime (using the so called 'heap'), doing so is avoided by default (the stack is used instead).
Tantalizingly, it is sometimes possible to replicate the C memory allocation strategy within the runtime environments of other programming languages. This has been demonstrated by asm.js, which allows code written in C or C++ to be translated into a subset of JavaScript and run safely in a web browser environment -- with near-native speed.
As somewhat of an aside, another area where C and C++ outshine most other languages for speed is the ability to seamlessly integrate with native machine instruction sets. A notable example of this is the (compiler and platform dependent) availability of SIMD intrinsics which support the construction of custom algorithms that take advantage of the now nearly ubiquitous parallel processing hardware -- while still utilizing the data allocation abstractions provided by the language (lower-level register allocation is managed by the compiler).