Home Developers In Depth The OCRopus Open Source System
The OCRopus Open Source System
Developers - In Depth
Written by Jenny Curtis   
Friday, 09 April 2010

           OCRopus is an open source Optical Character Recognition system which allows easy evaluation as well as reuse of the OCR components by researchers and companies. OCRopus uses the Apache 2 license, which allows contributions to be used commercially.

           OCRopus is designed with multi-script and multi-lingual recognition. It uses Unicode throughout and the HTML1 and CSS2 standards to represent typographic phenomena in many scripts and languages. The system makes it easy to integrate not only existing but also new algorithms.
           The architecture of the OCRopus system has three major components:
               1. Physical layout analysis, which enables the identification of text columns, text blocks, and text lines.
               2. Text line recognition, which enables the recognition of the text in each line (either vertical or right-to-left). It represents any possible recognition alternatives as a hypothesis graph.
               3. Statistical language modeling, which allows the integration of the recognition alternatives with knowledge about language, grammar, vocabulary, as well as the domain of the document.

        The OCRopus system attempts to approximate well-defined statistical measures at each of the processing steps. The processing steps include:

  • Preprocessing and Cleanup improve overall Optical Character Recognition error rates. OCRopus provides a scriptable toolbox which allows users to construct preprocessing and cleanup pipelines to suit their specific needs.
    Layout Analysis divides raw input images into text and non-text regions. Each region is classified into text, line drawing, grayscale image, ruling, etc.
  • Text Line Recognition - each text line is passed to a text line recognizer for the input document’s language.
  • Statistical Language Modeling – statistical language models, such as dictionaries and stochastic grammars, associate probabilities with strings. This resolves missing or ambiguous characters to their most likely interpretation.
  • hOCR Output - the system generates a final result, which is suitable for reading, further processing, editing, indexing, etc.  hOCR documents are standards-compliant HTML or XHTML documents. They can be viewed, edited, or indexed using existing HTML tools.
Related Articles:
ABBYY Recognition Server Boosts the Google Search Appliance
Software > In Depth
     ABBYY has recently announced a new optical character recognition service which ensures easy access to image-based content – the ABBYY Recognition Server for the Google Search Appliance. The new service solves the problem with the...
ABBYY Tutor 2.0 Takes Memorizing New Words to the Next Level
Software > In Depth
          ABBYY have announced the release of ABBYY Tutor 2.0, an application for vocabulary building for Microsoft Windows Mobile devices. The latest version is specifically designed for Russian-speaking users to help them memorize English...
How to Get the Best OCR Results Possible When Working with ABBYY FineReader
Software > Learning Center
          Selecting the right options is the key to getting quality OCR results. In deciding which options to use, what matters is the type and complexity of your paper document as well as the purpose you will use the electronic version...
ABBYY FlexiCapture Engine 8.0 is Pronounced Recognition/Data Capture Product of the Year 2009
Developers > News
          ABBYY FlexiCapture Engine 8.0 was awarded with the prize for Recognition/Data Capture Product of the Year at the Document Manager Awards in London on October 22nd, 2009.  Document Manager is a publication dedicated to document...
I.R.I.S. Signs an ECM contract with the UEMOA Commission in Burkina Faso
Companies > News
          I.R.I.S has signed an ECM contract with the Commission of the West African Economic and Monetary Union (UEMOA). The agreement covers the implementation of I.R.I.S.’s solution for electronic document management, archiving and...