问题
OCR generated texts sometimes come with artifacts, such as this one:
Diese grundsätzliche V e r b o r g e n h e i t Gottes, die sich n u r dem N a c h f o l g e r ö f f n e t , ist m i t d e m Messiasgeheimnis gemeint
While it is not unusual, that the spacing between letters is used as emphasis (probably due to early printing press limitations), it is unfavorable for retrieval tasks.
How can one turn the above text into a more, say, canonical form, like:
Diese grundsätzliche Verborgenheit Gottes, die sich nur dem Nachfolger öffnet, ist mit dem Messiasgeheimnis gemeint
Can this be done efficiently for large amounts of text?
One idea would be to concatenate the whole string (to skip the guessing, where word boundaries are) and then run a text segmentation algorithm on it, maybe something similar to this: http://norvig.com/ngrams/
回答1:
If you have a dictionary for the target language, and all spaced-out words consist of just a single word, then it's easy: Just scan through the text, looking for maximal-length runs of spaced-out single letters, and replace them with the single corresponding dictionary word if it exists (and otherwise leave them unchanged).
The only real difficulty is with strings like m i t d e m
that correspond to two or more separate words. A simple way would be to greedily "nibble off" prefixes that appear in the dictionary, but this might lead to suboptimal results, and in particular to a suffix that doesn't correspond to any dictionary string even though a different choice of breakpoints would have worked (e.g. b e i m A r z t
won't work if you greedily grab bei
instead of beim
from the front). Fortunately there's a simple linear-time DP approach that will do a better job -- and can even incorporate weights on words, which can help to get the most likely decomposition in the event that there is more than one. Given a string S[1 .. n] (with spaces removed), we will compute f(i), the score of the best decomposition of the length-i prefix of S, for all 1 <= i <= n:
f(0) = 0
f(i) = max over all 0 <= j < i of f(j) + dictScore(S[j+1 .. i])
f(n) will then be the score of the best possible decomposition of the entire string. If you set dictScore(T) to 1 for words that exist in the dictionary and 0 for words that don't, you will get a decomposition into as many words as possible; if you set dictScore(T) to, e.g., -1 for words that exist in the dictionary and -2 for words that don't, you'll get a decomposition into as few words as possible. You can also choose to award higher scores for more "likely" words.
After computing these scores, you can walk back through the DP matrix to reconstruct a decomposition that corresponds to the maximal score.
来源:https://stackoverflow.com/questions/31136692/how-to-remove-ocr-artifacts-from-text