I\'m working on a feature which requires me to get the contents of a webpage, then check to see if certain text is present in that page. It\'s a backlink checking tool.
I'm not entirely sold on your belief that it is the encoding. PHP is going to internally store all its strings in the same format. Could you try this code? It will compare the ascii value of each character in both strings, which might reveal something you're not seeing by visually comparing the strings.
$str1 = ...;
$str2 = ...;
if(strlen($str1) != strlen($str2)) {
echo "Lengths are different!";
} else {
for($i=0; $i < strlen($str1); $i++) {
if(ord($str1[$i]) != ord($str2[$i]) {
echo "Character $i is different! str1: " . ord($str1[$i]) . ", str2: " . ord($str2[$i]);
break;
}
}
}
what about running both through a sanatizing filter (if you have php >5.2.0). I don't know that it will do anything, but it may.
http://www.phpro.org/tutorials/Filtering-Data-with-PHP.html#12
Try mb_strstr() and trim(), as pointed by dcaunt.
You could try using the Dom Extension to PHP. On creating a new Dom Document you can specify the encoding of the underlying document / webpage. According to This website, internally everything is done in UTF-8. You could then find the dom nodes you were interested in, and compare the Text Content of the node
If you were not using webpages, with an associated specified character encoding, I would suggest using the multibyte functions, in particular mb_detect_encoding and mb_convert_encoding
Without application code it's difficult to say what's happening.
Try using trim() on the strings to remove trailing whitespace, which is invisible to the naked eye.
You may find strcmp gives better results as well.
If you can't reliably get the encoding, you can use mb_convert_encoding
.
$string1 = mb_convert_encoding($string1, 'utf-8', 'auto');
$string2 = mb_convert_encoding($string2, 'utf-8', 'auto');
If you can determine the encoding (from the http headers or meta tags) you should specify the encoding instead of using "auto."
$string1 = mb_convert_encoding($string1, 'utf-8', $encoding1);
$string2 = mb_convert_encoding($string2, 'utf-8', $encoding2);