should it really be utf8_general_ci or rather utf8_bin?
You must use utf8_bin for Case-sensitive search, otherwise utf8_general_ci
is mb_internal_encoding('UTF-8') necessary in PHP 5.3 and higher and if so does this mean I have to use all multi byte functions instead of its core functions like mb_substr() instead of substr()?
Yes of course, If you have a multibyte string you need mb_* family function to work with, except for binary safe php standard function like str_replace(); (and few others)
is it still necessary to check for malformed input stings and if so what is a reliable function/class to do so? I possibly do not want to strip bad data and don't know enough about transliteration.
Hmm, no you can't check it.