When I can use SSE3 or AVX, are then older SSE versions as SSE2 or MMX available -
or do I still need to check for them separately?
See Chuck's answer for good advice on what you should do. See this answer for a literal answer to the question asked, in case you're curious.
AVX support absolutely guarantees support for all Intel SSE* instruction sets, since it includes VEX-encoded versions of all of them. As Chuck points out, you can check for previous ones at the same time with a bitmask, without bloating your code, but don't sweat it.
Note that POPCNT, TZCNT, and stuff like that are not part of SSE-anything. POPCNT has its own feature bit. LZCNT has its own feature bit, too, since AMD introduced it separately from BMI1. TZCNT is just part of BMI1, though. Since some BMI1 instructions use VEX encodings, even latest-generation Pentium/Celeron CPUs (like Skylake Pentium) don't have BMI1. :( I think Intel just wanted to omit AVX/AVX2, probably so they could sell CPUs with faulty upper-lanes of execution units as Pentiums, and they do this by disabling VEX support in the decoders.
Intel SSE support has been incremental in all CPUs released so far. SSE4.1 implies SSSE3, SSE3, SSE2, and SSE. And SSE4.2 implies all of the preceding. I'm not sure if any official x86 documentation precludes the possibility of a CPU with SSE4.1 support but not SSSE3. (i.e. leave out PSHUFB, which is possibly expensive-to-implement.) It's extremely unlikely in practice, though, since this would violate many people's assumptions. As I said, it might even be officially forbidden, but I didn't check carefully.
AVX does not include AMD SSE4a or AMD XOP. AMD extensions have to be checked-for specially. Also note that the newest AMD CPUs are dropping XOP support. (Intel never adopted it, so most people don't write code to take advantage of it, so for AMD those transistors are mostly wasted. It does have some nice stuff, like a 2-source byte permute, allowing a byte LUT twice as wide as PSHUFB, without the in-lane limitation of AVX2's VPSHUFB ymm).
SSE2 is baseline for the x86-64 architecture. You do not have to check for SSE or SSE2 support in 64-bit builds. I forget if MMX is baseline, too. Almost certainly.
The SSE instruction set includes some instructions that operate on MMX registers. (e.g. PMAXSW mm1, mm2/m64 was new with SSE. The XMM version is part of SSE2.) Even a 32-bit CPU supporting SSE needs to have MMX registers. It would be madness to have MMX registers but only support the SSE instructions that use them, not the original MMX instructions (e.g. movq mm0, [mem]). However, I haven't found anything definitive that rules out the possibility of an x86-based Deathstation 9000 with SSE but not MMX CPUID feature bits, but I didn't wade into Intel's official x86 manuals. (See the x86 tag wiki for links).
Don't use MMX anyway, it's generally slower even if you only have 64 bits at a time to work on, in the low half of an XMM register. The latest CPUs (like Intel Skylake) have lower throughput for the MMX versions of some instructions than for the XMM version. In some cases, even worse latency. For example, according to Agner Fog's testing, PACKSSWB mm0, mm1
is 3 uops, with 2c latency, on Skylake. The 128b and 256b XMM / YMM versions are 1 uop, with 1c latency.