Written by Jenny Curtis   
Friday, 15 January 2010

         Tesseract is a free optical character recognition engine, originally developed as proprietary software by Hewlett-Packard, and later revived by Google.  It is one of the oldest engines of its kind, as it was first developed between 1985 and 1994.

          In 1995 it was one of the top three engines in the UNLV Accuracy test. In 2005, when almost no work had been done on it for ten years, HP and UNLV decided to release it as open source. It was then that Google specialists started work on reviving the two-decade-old OCR engine. The project was part of the overall goal of Google to organize and index the world’s information. With the help of Tesseract, other institutions and engineers would also be able to help digitize information in the form of papers.
          Tesseract is currently released under the Apache License, Version 2.0. It processes TIFF images of a single column to create text. Other formats need to be converted to TIFF before submitted to Tesseract. Nowadays it is still considered to be one of the most accurate OCR engines available. A raw OCR engine, Tesseract has no graphical user interface, no output formatting, and no document layout analysis, which means that it cannot interpret multi-column text or equations. Even though only Windows and Ubuntu Linux are actively tested by the developers, Tesseract can successfully be used on Mac OS X. It is suitable for use as a backend. The supported languages are English, Spanish, French, German, Italian, Dutch and Brazilian Portuguese but it can be trained to work in other languages.

          Download is available at http://code.google.com/p/tesseract-ocr

