Data from memory is typically delivered to the processor on a set of wires that matches the bus width. E.g., if the bus is 32 bits wide, there are 32 data wires going from the bus into the processor (along with other wires for control signals).
Inside the processor, various wires and switches deliver this data to wherever it is needed. If you read 32 aligned bits into a register, the wires can deliver the data very directly to a register (or other holding location).
If you read 8 or 16 aligned bits into a register, the wires can deliver the data the same way, and the other bits in the register are set to zero.
If you read 8 or 16 unaligned bits into a register, the wires cannot deliver the data directly. Instead, the bits must be shifted: They must go through a different set of wires, so that they can be “moved over” to line up with the wires going into the register.
In some processors, the designers have put additional wires and switches to do this moving. This can be very expensive in terms of the amount of silicon it takes. You need a lot of extra wires and switches in order to be able to move any possible unaligned bytes to desired locations. Because this is so expensive, in some processors, there is not a full shifter that can do all shifts immediately. Instead, the shifter might be able to move bits only by a byte or so per CPU cycles, and it takes several cycles to shift by several bytes. In some processors, there are no wires for this at all, so all loads and stores must be aligned.