I\'m trying to write an application to find the numbers inside an image and add them up.
How can I identify the written number in an image?
In most image processing problems you want to look to leverage as much information you have as possible. Given the image there are assumptions we can make (and possibly more):
Then we can simplify the problem using those assumptions:
The gist though is to use any assumptions that you can to reduce the problem into smaller, simpler sub problems. Then look to see what tools are available to solve each of those sub problems individually.
Assumptions are hard to make as well if you have to start worrying about the real world, like if these will be scanned it, you'll need to consider skew or rotation of the "template" or the numbers.
Neural networks is a typical approach for this kind of problems.
In this scenario, you can consider each handwritten number a matrix of pixels. You may get better results if you train the neural network with images of the same size as the image you want to recognize.
You can train the neural network with different images of handwritten numbers. Once trained, if you pass the image of the handwritten number to identify, it will return the most similar number.
Of course, the quality of training images is a key factor to get good results.
Give it up. Really. I as a human can not say for sure if the third letter is a '1' or a '7'. Humans are better in deciphering, so a computer will fail for this. '1' and '7' is only one problematic case, '8' and '6', '3' and '9' are also hard to decipher/distinguish. Your error quote will be >10%. If all the handwriting is from the same person you could try to train an OCR for that, but even in this case you will still have about ~3% errors. It might be that your use case is special, but this number of errors usually prohibits any kind of automated processing. I would look into Mechanical Turk if I really have to automate this.
Here's a simple approach:
Obtain binary image. Load the image, convert to grayscale, then Otsu's threshold to get a 1-channel binary image with pixels ranging from [0...255]
.
Detect horizontal and vertical lines. Create horizontal and vertical structuring elements then draw lines onto a mask by performing morphological operations.
Remove horizontal and vertical lines. Combine horizontal and vertical masks using a bitwise_or operation then remove the lines using a bitwise_and operation.
Perform OCR. Apply a slight Gaussian blur then OCR using Pytesseract.
Here's a visualization of each step:
Input image ->
Binary image ->
Horizontal mask ->
Vertical mask
Combined masks ->
Result ->
Applied slight blur
Result from OCR
38
18
78
I implemented it with Python but you can adapt a similar approach using Java
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
# Load image, grayscale, Otsu's threshold
image = cv2.imread('1.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
# Detect horizontal lines
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (25,1))
horizontal = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, horizontal_kernel, iterations=1)
# Detect vertical lines
vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1,25))
vertical = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, vertical_kernel, iterations=1)
# Remove horizontal and vertical lines
lines = cv2.bitwise_or(horizontal, vertical)
result = cv2.bitwise_not(image, image, mask=lines)
# Perform OCR with Pytesseract
result = cv2.GaussianBlur(result, (3,3), 0)
data = pytesseract.image_to_string(result, lang='eng', config='--psm 6')
print(data)
# Display
cv2.imshow('thresh', thresh)
cv2.imshow('horizontal', horizontal)
cv2.imshow('vertical', vertical)
cv2.imshow('lines', lines)
cv2.imshow('result', result)
cv2.waitKey()
I would recommend to combine 2 basic neural network components:
A perceptron is a very simple neural network component. It takes multiple inputs and produces 1 output. You need to train it by feeding it both inputs and outputs. It's a self learning component.
Internally it has a collection of weight factors, which are used to calculate the output. These weight factors are perfected during training. The beautiful thing about a perceptron is that, (with a proper training) it can handle data that it has never seen before.
You can make a perceptron more powerful by arranging it in a multi-layer network, meaning that the output of one perceptron acts as the input of another perceptron.
In your case you should use 10 perceptron networks, one for each numeric value (0-9).
But in order to use perceptrons you will need an array of numeric inputs. So first you need something to convert you visual image to numeric values. A Self Organized Map (SOM) uses a grid of inter-connected points. The points should be attracted to the pixels of your image (See below)
The 2 components work well together. The SOM has a fixed number of grid-nodes, and your perceptron needs a fixed number of inputs.
Both components are really popular and are available in educational software packages such as MATLAB.
This video tutorial demonstrates how it can be done in python using Google's TensorFlow framework. (click here for a written tutorial).
You will most likely need to do the following:
Apply the Hough Transform algorithm on the entire page, this should should yield a series of page sections.
For each section you get, apply it again. If the current section yielded 2 elements, then you should be dealing with a rectangle similar to the above.
Once that you are done, you can use an OCR to extract the numeric value.
In this case, I would recommend you take a look at JavaCV (OpenCV Java Wrapper) which should allow you to tackle the Hough Transform part. You would then need something akin to Tess4j (Tesseract Java Wrapper) which should allow you to extract the numbers you are after.
As an extra note, to reduce the amount of false positives, you might want to do the following:
Crop the image if you are certain that certain coordinates will never contain data you are after. This should give you a smaller picture to work with.
It might be wise to change the image to grey scale (assuming you are working with a colour image). Colours can have a negative impact on the OCR's ability to resolve the image.
EDIT: As per your comment, given something like this:
+------------------------------+
| +---+---+ |
| | | | |
| +---+---+ |
| +---+---+ |
| | | | |
| +---+---+ |
| +---+---+ |
| | | | |
| +---+---+ |
| +---+---+ |
| | | | |
| +---+---+ |
+------------------------------+
You would crop the image so that your remove the area which does not have relevant data (the part on the left) by cropping the image, you would get something like so:
+-------------+
|+---+---+ |
|| | | |
|+---+---+ |
|+---+---+ |
|| | | |
|+---+---+ |
|+---+---+ |
|| | | |
|+---+---+ |
|+---+---+ |
|| | | |
|+---+---+ |
+-------------+
The idea would be to run the Hough Transform so that you can get segments of the page which contain rectangles like so:
+---+---+
| | |
+---+---+
Which you would then apply the Hough Transform again and end up with two segments, and you take the left one.
Once that you have the left segment, you would then apply the OCR.
You can try to apply the OCR before hand, but at best, the OCR will recognize both numeric values, both written and both typed, which from what I get, is not what you are after.
Also, the extra lines which depict the rectangles might throw the OCR off track, and make it yield bad results.