Check for commonly mis-recognized characters in a string against a list of known strings

问题

Background

I have a list of codes in my (MySQL) database that consist of six (6) characters. They consist of numbers and letters chosen at random. They are considered case-insensitive, however they are stored as uppercase in the database. They may consist of the number 0 but never the letter O. I use these code as one-off authentication of users.

The Problem

The codes have been handwritten on cards and unfortunately some letters and numbers may look alike to some individuals. This is why I initially didn't use the letter O because of it's close appearance to the handwritten 0.

What I've done so far

I am able to check a code (case-insensitivly) against user input and determine if it is an exact match. If it's not I silently replace any O's with 0's and try again.

Question

My question is, how can i do this for other letter and numbers, such as those that I have listed below, and still be relatively confident I'm not authenticating a user as someone they are not? In this case, both characters can exist in a code. I have looked at the Levenshtein function in PHP (http://php.net/manual/en/function.levenshtein.php) as well as similar_text() (http://php.net/manual/en/function.similar-text.php) but neither is quite what I want so I'm thinking I might have to roll my own (possibly using them) to achieve this.

Similar characters:

S <=> 5
G <=> 6
I <=> 1

回答1:

The problem you're describing is really hash collisions. You have multiple possible input values, and you want them to resolve down into a single unambiguous key. I have a couple thoughts here.

As @bishop suggested, what you really need to determine is if any given input is unambiguous or not. My approach would be slightly different though:

For any given input, I would generate a list of all possible matching keys, and query the database for the entire list. If only one result is returned, then there is no problem and you can proceed based on that single record. It doesn't matter in this case if the user enter ABCDE5 or ABCDES because there's only one possible match in the database for either one.

In the event that more than one result is returned however, you have no way of determining if the user's input was accurate or if it was mis-keyed.

(In hindsight, it would have been best to design the keys so that none of the ambiguous character pairs were possible. Only allowing "S" and disallowing "5", for example, allows you to guarantee there will only ever be a single match for any given input, whether the user types "S" or "5", because you could always safely convert any 5's you see in input to S's knowing that they were input errors. In fact, depending on the exact values, you may be able to retroactively modify many or all of the keys in the database to follow this rule and make lookups less cumbersome.)

Anyway, in that ambiguous case, I don't think you don't have any choice but to push back to the user and ask them to re-check their input, hopefully explaining the possible gotchas in an on-screen message.

EDIT:

Here's an example for generating the possible values a user meant to enter based on the single input they actually provided:

<?php

$inputs = [
        'ABCDEF', // No ambiguity, DB should return 0 or 1 match.
        'AAAAA1', // One ambiguous char, user could have meant `AAAAAI`
                  // instead so search DB for both.
        '156ISG', // Worst case. If the DB values overlap a lot, there
                  // wouldn't be much hope of "guessing" what the user
                  // actually meant.
];

foreach ($inputs as $input) {
    print_r(generatePossibleMatches($input));
}

//----------------------------------------
function generatePossibleMatches($input) {
    $input = strtoupper($input);
    $ambiguous = [
        'I' => '1',
        'G' => '6',
        'S' => '5',
    ];
    $possibles = [$input];
    foreach ($ambiguous as $letter => $number) {
        foreach ($possibles as $possible) {
            foreach (str_split($possible) as $pos => $char) {
                $addNumber = substr_replace($possible, $number, $pos, 1);
                $addLetter = substr_replace($possible, $letter, $pos, 1);
                if ($char === $letter && !in_array($addNumber, $possibles)) {
                    $possibles[] = $addNumber;
                }
                if ($char === $number && !in_array($addLetter, $possibles)) {
                    $possibles[] = $addLetter;
                }
            }
        }
    }
    return $possibles;
}

回答2:

One solution: convert "confusing" characters into a regular expression matching the possible alternates, then match the expanded regular expression to the input. Example: if the input is "AIX", the regular expression expansion would be "A[I1]X".

Code:

$input = 'S1G6AB'; // given this
$store = '5I6GAB'; // need to match this

// convert each confusing character to a regular expression character class
$regex = implode('', array_map(function ($c) {
    $map = ['S'=>'[S5]','5'=>'[S5]','1'=>'[1I]','I'=>'[1I]','G'=>'[6G]','6'=>'[6G]'];
    return (array_key_exists($c, $map) ? $map[$c] : $c);
}, str_split($input)));

// match regex representing the input against the stored value    
echo (0 < preg_match("/$regex/", $store) ? 'Match' : 'No match');

Fiddle here

Obviously, this assumes that the permutations of any given input never appear in more than one record. If user X has "ABCDE1" and user Y has "ABCDEI", this won't work.

Edit building on @beporter answer

If your database supports regular expressions (like MySQL), you can ask it if there are collisions:

SELECT COUNT(*) FROM Table WHERE token REGEXP '$regex'

If that is 2 or more, you have a collision and you can ask the user to check the letters and try again. Or maybe ask them to enter some other part of their information, like last name? That would be a good question to take it to the UX people.

回答3:

Have you looked at Hamming Distance yet?

Although you have letter AND numbers, you could convert everything to binary (ASCII values) and compare them using the Hamming Distance. If the distance is greater than some threshold value, reject it. Otherwise, you are essentially looking for a string metric that meets your need to identify your "mis-recognized" characters. You are right -- you may have to build one yourself.

来源：https://stackoverflow.com/questions/25277365/check-for-commonly-mis-recognized-characters-in-a-string-against-a-list-of-known

标签

php

php-5.5