Of those algorithms you list, I believe RC4 is the fastest. In addition, the speed of RC4 does not depend on the key length once it has been initialized. So you should be able to use the maximum key size for that one without worrying about runtime cost.