Is there any (non-microoptimization) performance gain by coding
float f1 = 200f / 2
in comparision to
float f2 = 200f * 0.5
Think about what is required for multiplication of two n bit numbers. With the simplest method, you take one number x and repeatedly shift and conditionally add it to an accumulator (based on a bit in the other number y). After n additions you are done. Your result fits in 2n bits.
For division, you start with x of 2n bits and y of n bits, you want to compute x / y. The simplest method is long division, but in binary. At each stage you do a comparison and a subtraction to get one more bit of the quotient. This takes you n steps.
Some differences: each step of the multiplication only needs to look at 1 bit; each stage of the division needs to look at n bits during the comparison. Each stage of the multiplication is independent of all other stages (doesn't matter the order you add the partial products); for division each step depends on the previous step. This is a big deal in hardware. If things can be done independently then they can happen at the same time within a clock cycle.