This article explores theOCR (Optical Character Recognition), technology ofextracting data from images. Tools such as Google Cloud Vision, Amazon Textract, and Tesseract are compared. Cloud solutions dominate; Tesseract, although free, may be less efficient for handwritten text or low quality documents.
Optical Character Recognition, or OCR, or Optical Character Recognition, technologies are methods for extracting data from unstructured image-type documents. They are part of the large family of Computer Vision algorithms and are present all around us on a daily basis. For example, OCR will make it possible to visually identify credit cards, bank checks, invoices or expense reports.
OCR tools have become very efficient and offer time and quality savings. In fact, their use is increasing in all economic sectors.
In this article, we compare 3 OCR tools that we were able to test during missions at Aqsone and which are:
Optical Character Recognition is not a recent subject and the first applications emerged at the beginning of the 20th century. For example, OCR use cases were aimed at detecting walrus, Braille, or typed characters.
Over the years, and with the appearance of the first CPUs, projects on a larger scale emerged in areas such as postal services, customs services, and the army.
In 2005, Hewlett Packard and the University of Nevada released the module Tesseract Open Source OCR thus expanding the use of these technologies.
In 2013, the MNIST database was published, which contains 60,000 images of handwritten numbers in black and white. A data set widely used in Machine Learning on Computer Vision topics.
Finally, the last few years have seen the emergence of Cloud OCR models such as Google Cloud Vision or Amazon Textract.
The current state of the art is thus composed mainly of:
Unless they want a 100% free solution, cloud providers are probably the leaders in the current market.
The table below summarizes the comparison of the various OCR solutions. The criteria we put forward are based on the business use cases we have encountered, and have generally been decisive for the selection of one OCR solution over another.
Examples of use:
In conclusion, Amazon Textract and Google Cloud Vision are two solutions that offer very similar possibilities and performances. Their prices are also very similar since it will take $1.5 for 1000 units to use the basic OCR feature, at the delta of discounts from a large number of units.
Google Cloud Vision will be particularly easy to use for GCP users and also has the advantage of being able to integrate with other Google Cloud services. However, its configuration can be complex for those new to the platform. Amazon Textract provides a Drag & Drop interface that makes it even easier to use.
In terms of performance, GCV seems to be more efficient than Textract for the detection of handwritten text. On the other hand, GCV is fishing on table extraction since the text is detected normally and not in table form.
Tesseract has the advantage of being free and of not requiring any special configuration. It is easy to use, on the command line or via the pytesseract Python library.
In terms of performance, Tesseract is effective on high resolution data. Its performance is reduced on low quality data or handwritten text. Table extraction is possible, but can be complex to implement and will also be very sensitive to the quality of the documents.
It is a solution that remains interesting because of its accessibility and its effectiveness on certain specific data formats.