OCRopus is an open source Optical Character Recognition system which allows easy evaluation as well as reuse of the OCR components by researchers and companies. OCRopus uses the Apache 2 license, which allows contributions to be used commercially.
OCRopus is designed with multi-script and multi-lingual recognition. It uses Unicode throughout and the HTML1 and CSS2 standards to represent typographic phenomena in many scripts and languages. The system makes it easy to integrate not only existing but also new algorithms.
The architecture of the OCRopus system has three major components:
1. Physical layout analysis, which enables the identification of text columns, text blocks, and text lines.
2. Text line recognition, which enables the recognition of the text in each line (either vertical or right-to-left). It represents any possible recognition alternatives as a hypothesis graph.
3. Statistical language modeling, which allows the integration of the recognition alternatives with knowledge about language, grammar, vocabulary, as well as the domain of the document.
The OCRopus system attempts to approximate well-defined statistical measures at each of the processing steps. The processing steps include:
Preprocessing and Cleanup improve overall Optical Character Recognition error rates. OCRopus provides a scriptable toolbox which allows users to construct preprocessing and cleanup pipelines to suit their specific needs.
Layout Analysis divides raw input images into text and non-text regions. Each region is classified into text, line drawing, grayscale image, ruling, etc.
Text Line Recognition - each text line is passed to a text line recognizer for the input document’s language.
Statistical Language Modeling – statistical language models, such as dictionaries and stochastic grammars, associate probabilities with strings. This resolves missing or ambiguous characters to their most likely interpretation.
hOCR Output - the system generates a final result, which is suitable for reading, further processing, editing, indexing, etc. hOCR documents are standards-compliant HTML or XHTML documents. They can be viewed, edited, or indexed using existing HTML tools.