Pure untyped lambda calculus is a powerful concept. However, building a machine or interpreter for real-world use is often described as (close to) impossible. I want to investig
First, it is possible to compile the lambda calculus efficiently to machine code even on existing architectures. After all, scheme is the lambda calculus plus a bit extra, and it can be compiled efficiently. However, scheme & co are the lambda calculus under strict evaluation. It is also possible to compile the lambda calculus under non-strict evaluation efficiently! On this, see SPJ's two books for some background: http://research.microsoft.com/en-us/um/people/simonpj/papers/papers.html
On the other hand, it is also true that if we built hardware designed for functional languages, we could compile code to that hardware and do very well indeed. The best new stuff on this I know of is the Reduceron: http://www.cs.york.ac.uk/fp/reduceron/
The key to the performance of the Reduceron, which is quite compelling, is that it is built around parallel graph reduction, and aims to exploit the opportunities for parallelism made explicit in the reduction of lambda calculus equations.