Given two equal-length strings, is there an elegant way to get the offset of the first different character?
The obvious solution would be:
for ($offs
You can use a nice property of bitwise XOR (^) to achieve this: Basically, when you xor two strings together, the characters that are the same will become null bytes ("\0"
). So if we xor the two strings, we just need to find the position of the first non-null byte using strspn:
$position = strspn($string1 ^ $string2, "\0");
That's all there is to it. So let's look at an example:
$string1 = 'foobarbaz';
$string2 = 'foobarbiz';
$pos = strspn($string1 ^ $string2, "\0");
printf(
'First difference at position %d: "%s" vs "%s"',
$pos, $string1[$pos], $string2[$pos]
);
That will output:
First difference at position 7: "a" vs "i"
So that should do it. It's very efficient since it's only using C functions, and requires only a single copy of memory of the string.
function getCharacterOffsetOfDifference($str1, $str2, $encoding = 'UTF-8') {
return mb_strlen(
mb_strcut(
$str1,
0, strspn($str1 ^ $str2, "\0"),
$encoding
),
$encoding
);
}
First the difference at the byte level is found using the above method and then the offset is mapped to the character level. This is done using the mb_strcut function, which is basically substr
but honoring multibyte character boundaries.
var_dump(getCharacterOffsetOfDifference('foo', 'foa')); // 2
var_dump(getCharacterOffsetOfDifference('©oo', 'foa')); // 0
var_dump(getCharacterOffsetOfDifference('f©o', 'fªa')); // 1
It's not as elegant as the first solution, but it's still a one-liner (and if you use the default encoding a little bit simpler):
return mb_strlen(mb_strcut($str1, 0, strspn($str1 ^ $str2, "\0")));
If you convert a string to an array of single character one byte values you can use the array comparison functions to compare the strings.
You can achieve a similar result to the XOR method with the following.
$string1 = 'foobarbaz';
$string2 = 'foobarbiz';
$array1 = str_split($string1);
$array2 = str_split($string2);
$result = array_diff_assoc($array1, $array2);
$num_diff = count($result);
$first_diff = key($result);
echo "There are " . $num_diff . " differences between the two strings. <br />";
echo "The first difference between the strings is at position " . $first_diff . ". (Zero Index) '$string1[$first_diff]' vs '$string2[$first_diff]'.";
$string1 = 'foorbarbaz';
$string2 = 'foobarbiz';
$array1 = preg_split('((.))u', $string1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
$array2 = preg_split('((.))u', $string2, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
$result = array_diff_assoc($array1, $array2);
$num_diff = count($result);
$first_diff = key($result);
echo "There are " . $num_diff . " differences between the two strings.\n";
echo "The first difference between the strings is at position " . $first_diff . ". (Zero Index) '$string1[$first_diff]' vs '$string2[$first_diff]'.\n";
string strpbrk ( string $haystack , string $char_list )
strpbrk() searches the haystack string for a char_list.
The return value is the substring of $haystack which begins at the first matched character. As an API function it should be zippy. Then loop through once, looking for offset zero of the returned string to obtain your offset.
I wanted to add this as as comment to the best answer, but I do not have enough points.
$string1 = 'foobarbaz';
$string2 = 'foobarbiz';
$pos = strspn($string1 ^ $string2, "\0");
if ($pos < min(strlen($string1), strlen($string2)){
printf(
'First difference at position %d: "%s" vs "%s"',
$pos, $string1[$pos], $string2[$pos]
);
} else if ($pos < strlen($string1)) {
print 'String1 continues with' . substr($string1, $pos);
} else if ($pos < strlen($string2)) {
print 'String2 continues with' . substr($string2, $pos);
} else {
print 'String1 and String2 are equal';
}