I know that devices before the Fermi architecture had 8 SPs in a single multiprocessor. Is the count same in Fermi architecture?
The answer depends on the Compute Capability property of the CUDA device. The numbers are:
See appendix G of the CUDA C Programming Guide.