With so many implementations available, what is the fastest executing (least CPU intensive, smallest binary), cross-platform (Linux, Mac, Windows, iPhone) A* implementation for
When you have specific bounds that you can work with, you're usually better off writing the algorithm yourself. In particular your small state space lends itself to optimisations that spend memory up-front to reduce CPU time, and the fact that you're using a grid rather than an arbitrary state space allows you to do things like optimise your successor node generation, or be able to treat all partial paths that end on the same grid square as equivalent (which a normal A* search will not and cannot assume).
(PS. OpenSteer, a collection of steering behaviours, is nothing to do with A*, which is a search algorithm, except that you can notionally use one, the other, or both to traverse a space. One isn't a replacement for the other in most reasonable circumstances.)