In an AI application I am writing in C++,
Virtual functions tend to be a lookup and indirection function call. On some platforms, this is fast. On others, e.g., one popular PPC architecture used in consoles, this isn't so fast.
Optimizations usually revolve around expressing variability higher up in the callstack so that you don't need to invoke a virtual function multiple times within hotspots.
The only optimization I can think of is Java's JIT compiler. If I understand it correctly, it monitors the calls as the code runs, and if most calls go to particular implementation only, it inserts conditional jump to implementation when the class is right. This way, most of the time, there is no vtable lookup. Of course, for the rare case when we pass a different class, vtable is still used.
I am not aware of any C++ compiler/runtime that uses this technique.
A solution to dynamic polymorphism could be static polymmorphism, usable if your types are known at compile type: The CRTP (Curiously recurring template pattern).
http://en.wikipedia.org/wiki/Curiously_recurring_template_pattern
The explanation on Wikipedia is clear enough, and perhaps It could help you if you really determined virtual method calls were source of performance bottlenecks.
You can implement polymorfism in runtime using virtual functions and in compile time by using templates. You can replace virtual functions with templates. Take a look at this article for more information - http://www.codeproject.com/KB/cpp/SimulationofVirtualFunc.aspx
You rarely have to worry about cache in regards to such commonly used items, since they're fetched once and kept there.
Cache is only generally an issue when dealing with large data structures that either:
Things like Vtables are generally not going to be a performance/cache/memory issue; usually there's only one Vtable per object type, and the object contains a pointer to the Vtable instead of the Vtable itself. So unless you have a few thousand types of objects, I don't think Vtables are going to thrash your cache.
1), by the way, is why functions like memcpy use cache-bypassing streaming instructions like movnt(dq|q) for extremely large (multi-megabyte) data inputs.
With modern, ahead-looking, multiple-dispatching CPUs the overhead for a virtual function might well be zero. Nada. Zip.