I am writing a simple OCR solution for a finite set of characters. That is, I know the exact way all 26 letters in the alphabet will look like. I am using C# and am able to easi
Why not just consider the image as an 25-bit integer? A 32-bit int may work. For example, the letter 'I' can be treat as an integer 14815374 in decimal for its binary expression is 0111000100001000010001110. It's convenience for you to compare two images with the operation '==' as two integer.
I don't have an algorithm to give you the key features, but here are some things that might help.
First, I wouldn't worry too much about looking for a characteristic pixel for each character because, on the average, checking if a given character matches with a given swath (5x5) of the binary image shouldn't take more than 5-7 checks to tell that the there isn't a match. Why? Probability. For 7 binary pixels, there are 2**7=128 different possibilities. That means there is a 1/128 < 1% chance of a character matching even up to 7 pixels. Just make sure that you stop the comparisons right when you find a mismatch.
Second, if you don't want to do a hash table, then you might consider using a trie to store all of your character data. It will use less memory, and you'll be checking all of the characters at once. It won't be quite as fast to search through as a hash table, but you also won't have to convert to a string. At each node in the tree, there can only be at most 2 descendants. For instance, if you have two 2x2 characters (let's call them A and B):
A B
01 00
10 11
You trie would have only one descendant at the first node - only to the left(the 0 branch). We proceed to this next node. It has two descendents, the left (0) branch leads to the rest of B and the right (1) branch leads to the rest of A. You get the picture. Let me know if this part isn't clear.
One way would be to identify a pixel that's black in roughly half of the letters and white in the other set. This can then be used to split the letters into two groups, using the same algorithm on both halves recursively, until you have reached individual characters.
If you can't find a single pixel that splits the sets into two, you may have to go to a group of two or more pixels, but hopefully using single pixels ought to be good enough.
To find the pixel, start with an array of integers, the same size as your letters, initialize all elements to 0, then increment the elements if the corresponding pixel in a letter is (say) black. The ones you're interested in are the ones in the (roughly) 10≤sum≤16 range (for the top level, lower levels would need to use other bounds).
I don't have an answer, but here are some bounds on your eventual solution:
If you want a straight up "use X pixels as a key" then you'll need at least ceiling(log2(number of characters))
pixels. You won't be able to disambiguate letters with less bits. In your case, trying to find the 5 pixels is equivalent to finding 5 pixels that split the letters into independent partitions. It probably isn't that easy.
You can also use Moron's (heheh) suggestion and build a tree based on the letter frequencies of the language you are scanning similar to Huffman coding. That would take up more space than 5-bits per letter, but would probably be smaller assuming a power-law distribution of letter usage. I would go with this approach as it allows you to search for a specific partition for each node rather than searching for a set of partitions.
I am going down a similar track trying to invent an algorithm that will give me a minimal number of tests I can use to match an image to one I've seen previously. My application is OCR but in a limited domain of recognising an image from a fixed set of images as fast as possible.
My basic assumption (which I think is the same as yours, or was the same) is that if we can identify one unique pixel (where a pixel is defined as a point within an image plus a color) then we have found the perfect (fastest) test for that image. In your case you want to find letters.
If we cannot find one such pixel then we (grudgingly) look for two pixels that in combination are unique. Or three. And so on, until we have a minimal test for each of the images.
I should note that I have a strong feeling that in my particular domain I will be able to find such unique pixels. It might not be the same for your application where you seem to have a lot of "overlap".
After considering comments in this other question (where I'm just starting to get a feel for the problem) and comments here I think I might have come up with a workable algorithm.
Here is what I've got so far. The method I describe below is written in the abstract but in my application each "test" is a pixel identified by a point plus a color, and a "result" represents the identity of an image. Identification of these images is my end goal.
Consider the following tests numbered T1 to T4.
This list of tests can be interpreted as follows;
For each individual result A, B, C, D, we want to find a combination of tests (ideally just one test) that will allow us to test for an unambiguous result.
Applying intuition and with a bit of squinting at the screen we can fumble our way to the following arrangement of tests.
For A we can test for a combination of T4 (either A or D) AND T1 (A but not D)
B is easy since there is a test T2 that gives result B and nothing else.
C is a bit harder, but eventually we can see that a combination of T3 (A or C or D) and NOT T4 (not A and not D) gives the desired result.
And similarly, D can be found with a combination of T4 and (not T1).
In summary
A <- T4 && T1
B <- T2
C <- T3 && ¬T4
D <- T4 && ¬T1
(where <-
should be read as 'can be found if the following tests evaluate to true')
Intuition and squinting is fine, but we probably won't get these techniques built into the language until at least C# 5.0, so here is an attempt at formalising the method for implementation in lesser languages.
To find a result R
,
Tr
that gives the desired result R
and the fewest unwanted results (ideally no others)R
and nothing else we are finished. We can match for R
where Tr
is true.X
in the test Tr
;
Tn
that gives R
but not X
. If we find such a test we can then match for R
where (T && Tn)
Tx
that includes X
but does not include R
. (Such a test would eliminate X
as a result from test Tr
). We can then test for R
where (T && ¬Tx)
Now I will try to follow these rules for each of the desired results, A, B, C, D.
Here are the tests again for reference;
According to rule (1) we start with T4 since it is the simplest test that gives result A. But it also gives result 'D' which is an unwanted result. According to rule (3) we can use test T1 since it includes 'A' but does not include 'D'.
Therefore we can test for A with
A <- T4 && T1
To find 'B' we quickly find test T2 which is the shortest test for 'B' and since it gives only result 'B' we are finished.
B <- T2
To find 'C' we start with T1 and T3. Since the results of these tests are equally short we arbitrarily choose T1 as the starting point.
Now according to (3a) we need to find a test that includes 'C' but not 'A'. Since no test satisfies this condition we cannot use T1 as the first test. T3 has the same problem.
Being unable to find a test that satisfies (3a) we now look for a test that satisfies condition (3b). We look for a test that gives 'A' but not 'C'. We can see that test T4 satisfies this condition, so therefore we can test for C with
C <- T1 && ¬T4
To find D we start with T4. T4 includes unwanted result A. There are no other tests that give the result D but not A so we look for a test that gives A but not D. Test T1 satisfies this condition so therefore we can test for D with
D <= T4 && ¬T1
These results are good but I don't think I've quite debugged this algorithm enough to have 100% confidence. I'm going to think about it a bit more and maybe code up some tests to see how it holds up. Unfortunately the algorithm is just complex enough that it will take more than a few minutes to implement carefully. It might be days before I conclude anything further.
Update
I found that it is optimal to simultaneously look for tests that satisfy (a) OR (b) rather than look for (a) and then (b). If we look first for (a) we might get a long list of tests when we might have got a shorter list by allowing some (b) tests.
You could create a tree.
Pick a pixel, and divide the letters into two buckets, based on the pixel being white or black. Then pick a second pixel, split the buckets into two buckets each based on that pixel and so on.
You could try to optimize the depth of the tree by choosing pixels which give buckets which are approximately equal in size.
Creating the tree is a one time preprocess step. You should not have to do it multiple times.
Now when you get an alphabet to match, follow the tree based on the pixels set/not set and get your letter.