Tesseract 3 is able to perform page layout analysis. However, I couldn\'t find any sample code or documentation on how to use the library for such purposes. I hope someone h
Not sure if this exactly answers your question, but I landed here looking for ways to get the bbox-coordinates info (and text recognised inside the bbox optionally) given an input image. The solution to which is now possible using tesseract.
$> tesseract test.tiff test.txt -l eng -psm 1 tsv
The params to notice in above code-snippet are 'psm' and 'tsv'. 'psm' selects the page segmentation mode and 'tsv' generates a nice tabular output file with all the information (page-block-line number, bbox coods, confidence, predicted text) you'd need on your text-image (shown below)
level page_num block_num par_num line_num word_num left top width height conf text
1 1 0 0 0 0 0 0 5500 4250 -1
2 1 1 0 0 0 327 285 2218 53 -1
3 1 1 1 0 0 327 285 2218 53 -1
4 1 1 1 1 0 327 285 2218 53 -1
5 1 1 1 1 1 327 285 246 38 87 INFOPAC
5 1 1 1 1 2 620 287 165 38 87 PAGE
5 1 1 1 1 3 952 290 100 37 95 NAME
5 1 1 1 1 4 1173 292 1082 45 39 ENTRYDATE
5 1 1 1 1 5 2333 302 212 36 48 EMAIL
First, initialize TessBaseAPI
instance. You can either use Init()
(if you want to perform further text recognition) or InitForAnalysePage()
(if you're interested just in text boxes).
Second, set the image using SetImage()
.
And finally, call AnalyseLayout()
to get PageIterator
which provides you with text boxes.
tesseract::TessBaseAPI tessApi;
tessApi.InitForAnalysePage();
// tessApi.SetImage(...);
tesseract::PageIterator *iter = tessApi.AnalyseLayout();
// Instead of RIL_WORD you can use any other PageSegMode
while (iter->Next(tesseract::RIL_WORD)) {
int left, top, right, bottom;
iter->BoundingBox(
tesseract::RIL_WORD,
&left, &top, &right, &bottom
);
}
Tesseract can be given a page mode parameter (-psm
) which can have the following values:
0
= Orientation and script detection (OSD) only.1
= Automatic page segmentation with OSD.2
= Automatic page segmentation, but no OSD, or OCR3
= Fully automatic page segmentation, but no OSD. (Default)4
= Assume a single column of text of variable sizes.5
= Assume a single uniform block of vertically aligned text.6
= Assume a single uniform block of text.7
= Treat the image as a single text line.8
= Treat the image as a single word.9
= Treat the image as a single word in a circle.10
= Treat the image as a single character.Example:
tesseract image.tif image.txt -l eng -psm 0
However, I am not sure that it is possible to use the layout analysis in standalone mode.
There is an option since 3.04:
tesseract -c preserve_interword_spaces=1 test.tif test
Here is a reference to what looks like the related development thread.