Using collation xxx_german2_ci
which treats ü
and ue
as identical, is it possible to have all occurences of München
be hi
In the end I decided to do it all in PHP, therefore my question about which characters are equal with utf8_general_ci.
Below is what I came up with, by example: A label is constructed from a text
$description
, with sub strings $term
highlighted, and special characters
converted. Substitution is not complete, but probably sufficient for the actual
use case.
mb_internal_encoding("UTF-8");
function withoutAccents($s) {
return strtr(utf8_decode($s),
utf8_decode('àáâãäçèéêëìíîïñòóôõöùúûüýÿß'),
'aaaaaceeeeiiiinooooouuuuyys');
}
function simplified($s) {
return withoutAccents(strtolower($s));
}
function encodedSubstr($s, $start, $length) {
return htmlspecialchars(mb_substr($s, $start, $length));
}
function labelFromDescription($description, $term) {
$simpleTerm = simplified($term);
$simpleDescription = simplified($description);
$lastEndPos = $pos = 0;
$termLen = strlen($simpleTerm);
$label = ''; // HTML
while (($pos = strpos($simpleDescription,
$simpleTerm, $lastEndPos)) !== false) {
$label .=
encodedSubstr($description, $lastEndPos, $pos - $lastEndPos).
'<strong>'.
encodedSubstr($description, $pos, $termLen).
'</strong>';
$lastEndPos = $pos + $termLen;
}
$label .= encodedSubstr($description, $lastEndPos,
strlen($description) - $lastEndPos);
return $label;
}
echo labelFromDescription('São Paulo <SAO>', 'SAO')."\n";
echo labelFromDescription('München <MUC>', 'ünc');
Output:
<strong>São</strong> Paulo <<strong>SAO</strong>>
M<strong>ünc</strong>hen <MUC>
I have found this tables: http://developer.mimer.com/collations/charts/index.tml. They are, of course, landuage dependant. Collation is just comapring algorithm. For general utf8 I am not sure, how it treats special characters.
You can use them to found desired symbols and replace them in output to get same result as in example. But for those, you will need some programming language (PHP or anything else).
Another resources:
http://collation-charts.org/
http://mysql.rjweb.org/doc.php/charcoll (down on the page)
Basicly, try to google "collation algorithm mysql utf8_general_ci" or something like this