We use OCR or optical character recognition to convert scanned image files to readable and editable text. It’s currently the best solution we have to convert the paper office into a digital office. While we do aknowledge this is a great tool to have, the technology still has some limitations.
I always say that we need to work together with the operators to extract the maximum from OCR technology. We see experience with the optical character recognition software as the best way to achieve maximum results. With every new version of the software, we see improvements in the accuracy of the recognition results. The OCR accuracy always improves, and especially the OCR software for windows is moving at really high speed.
It also allows you to create word, rtf and even the famous pdf file with searchable option. These are all standard features in just about any OCR application you can currently get. Having the option to convert paper files into digital format with the added bonus of text recognition is great nonetheless.
|File Formats Output
|Gothic and Fraktur
|Abby Finereader PDF
|PDF File, PDF/A, Word, RTF, DocX, Doc, XLS, CSV etc
|Check the price now
|Adobe Acrobat Pro
|PDF File, DOC, Docx
|Check the price now
What does OCR stand for?
OCR stands for optical character recognition. We understand this type of solution as the electronic or mechanical conversion from images with printed, handwritten or even typed text into machine encoded text. The source for this process can be scanned documents, a photo of a book page or a document, or even pictures that include text. The best example for this is the new google captcha solution that asks you to identify billboards with different writing on them.
As this technology evolves, we tend to implement it more often. This is in different sectors of business, or even day to day life. We see it as the result of the evolution of pattern recognition. Whether this is through computer vision or artificial intelligence. As time passes we think OCR or pattern recognition will become even more widespread.
There are now different online ocr apps for you to choose from. Most of them generate texts from pdf files and converts them to different text formats or various pdf solutions. Some will feature an editing software or what you would call a text editor, where you can modify the characters, layouts or even the formatting.
Who uses OCR?
The first and main implementation of OCR is in the data entry sector, or most document imaging applications. We use optical character recognition to automate a lot of data entry tasks, whether at low or at mass scale. The best example is the business sectors which require a lot of form processing. This can be in the financial business for example, with the conversion of paper statements to digital data. Medical forms and records are no strangers to OCR, as they contain a lot of typed data that needs to be stored in databases.
We also see a lot of users implementing advanced OCR technology for book publishing. Especially for out of print books, I think that OCR helps a lot for reprinting. It practically does all the work for us, all we have to do is a soft proof correction of the text. Maybe sometimes we will have to correct the layout a bit. But just imagine doing this by hand, we would need weeks to have a final result.
Last but not least, I’ve seen OCR in different data mining and text to speech applications. These are practically the latest solutions to integrate OCR. I really think we can do wonderful things with OCR in new age media apps. As we see it, we could go as far as to create automated processes in a lot of fields. OCR will be at the front of these new implementations.
What are OCR and its use in digitization?
So how exactly does optical character recognition work? In a few words, this technology looks at your scanned documents and searches for the patterns that best resemble a character. Whether this is an alphabetic character, a number or even conventional signs. Most of the languages are included in the libraries, including chinese, japanese and even arabic.
Most OCR software developing companies have taken this simple approach and expanded it. We can see different languages, alphabets or even really impressive conventional sign recognition. Even automatically recognizing driving signs is of big help, and at its basis, it is still pattern recognition. Still, this technology is used mainly for pdf documents, in areas such as invoice processing, forms processing and other office digitized documents.
Let’s take books for example. We use to OCR to determine the words and layout from a scanned book. After we run it through the automated recognition process, we go further to proofreading. The better the technology, the more accurate the results. Some developers are focused on improving their technology by integrating different patterns of layouts. This in turn helps the software achieve better recognition accuracy.
When processing forms, OCR is great to create an automated csv or xml file. We then use these csv or xml files and integrate them into our database. I really like using such automation as it saves a lot of time and resources. Not to mention that we can create huge databases really quickly from data written on paper.
What is an example of OCR?
The basic file formats for OCR are PDF, Word or Rtf files and XML or CSV. The PDF delivered from an OCR software can be read and modified in any PDF editor. You can change the words, characters, layout or even create a new PDF based on the old file. There are various types of PDF that can be created with OCR software, depending on your exact needs.
Word or RTF files are great to have when you want to modify and edit text and layouts. For example in the book publishing industry, OCR is a usual step of the publishing process. As we mentioned before, when we want to edit out of print books, optical character recognition will speed up the process significantly. When we want to create ebooks, we also prefer using OCR if we don’t have the text file formats. The process itself is more or less like the re-printing process, but the final product is digital.
Last but not least we also create XML or CSV files from the data we OCR. The main applications for these file formats are database creation. When we want to transfer paper archives to digital archives, the optical recognition tools are a must. Just think of the applications in the medical industry. Especially when creating a lot of paper data each day, in the end we prefer to have it digital. This is where XML and CSV come into play.
How do I OCR a PDF?
Most of the questions regarding OCR revolve around how to recognize the characters in a scanned PDF. Well it’s pretty simple. We will take it step by step, as this will be much easier to understand.
- Take a piece of paper with printed characters and scan it. We recommend you use a resolution of minimum 300dpi for best results. Also, take a look at the image quality settings. The better the overall quality of the image the more accurate the recognition process.
- Once you have the digital file, insert it in the recognition software. You can use Abbyy Finereader, ADOBE Acrobat Pro or Omniscan for that matter. There are also other solutions out there, it depends on your choice or what you have available.
- As soon as you see the automated recognition finished, you can analyze the result. The software will show you what parts of the page were detected as characters, tables or other special formats and also images. Make sure the detection is accurate. If not, in the text editor, you can correct it manually and reprocess. For most documents, the accuracy should be over 95%.
- Last but not least, you have to choose the file format you want to save it in. If you want to go with PDF, you can choose between a standard PDF or a PDF/A, which is the higher quality archiving version. Also, you might want to save it to a word file, CSV or other output format files available.
That’s about it. We have used OCR on our first file. Of course, in practice, things will not always go that smooth. It depends on a lot of factors, such as the complexity of the layout, language and even the fonts. Not to mention the scanning quality which can sometimes be decisive. Also, you can use batch processing, which automates a lot of the tasks.
Which is the best OCR software?
The best OCR software currently on the market is without doubt the Abbyy Finereader. There are more variants for this software, depending on what your application is. There is for example the Flexicapture, which is probably the best form recognition currently available on the market. It automates a lot of tasks, from capturing, sorting and even the manual correction can be automated.
For more advanced users, with larger volumes, you can use the Abbyy Recognition Server. This is focused mainly on productivity and should be used for mass digitization only. It can be used on smaller projects, but its high price does not make sense unless you are processing huge amounts of data. Or, you can use an OCR api, which is used to create text formats or it extracts data in your own developed software.
Other software is currently available, such as omnipage ultimate or Adobe Acrobat PRO. Both have good ocr accuracy and inteligent character recognition. Also, most software out there offer included image preprocessing, which is good to have. They can both convert pdf scanned files to ocr texts.
Image preprocessing is a feature which you either include in your workflow for scanned images before the OCR, or you rely on the OCR software to do its magic. Image files are not created equal, therefore you will have to clean them before OCR, to improve the recognition accuracy. This is especially important for handwriting recognition.
Abbyy FineReader PDF
This is the starting point of OCR software. It’s the cheapest solution from Abbyy, but it does a lot of interesting things. It lets you do basic OCR and create different file formats from your scanned data. Actually, Abbyy has itnegrated even more features, and now the solution can easily compete with Adobe Acrobat Pro. We like that it includes a very powerful OCR technology with the basic needs for PDF editing.
What we like
- it does not cost much. It’s in the same ballpark as the Adobe Acrobat Pro.
- the OCR processor is by far the most advanced on the market. It really does not get better than this.
- You can save to various file formats, such as numerous types of PDF, Word, RTF, XML, CSV and also Ebook file formats.
- A lot of routines can be automated with minimal training. You will notice that very quickly you will be using these automated routines.
- It now includes Gothic and Fraktur characters, which in the past you had to pay extra for.
- You can use the software on both Mac and Windows.
What we don’t like
- The biggest drawback is that the software has a restriction on the computing power it uses. This means the software is not running at full capacity, and it takes more time than necessary.
- To get higher processing speeds and use all the cores of your computer, you have to buy the subscription based Abbyy Recognition Server or in some markets Abbyy Finereader Server.
How much does it cost?
To learn how much Abbyy Finereader costs and which one you should get, click the following link and see the prices for yourself.
Adobe Acrobat PRO
Everyone has heard about Adobe Acrobat Reader DC. It’s a free PDF tool that offers minimal PDF manipulation but with enough versatility for most users out there. Adobe Acrobat Pro is the premium paid version of the Reader DC.
When compared to the free version, the PRO allows you to convert pdf files to ocr texts, or text based files such as Word, RTF and many other output format files. What I like also is the batch processing of PDF files when you want to recognize the text. Another aspect you might also find interesting is the photo editing, but I would not really recommend this product for that.
Just like Abbyy, Adobe has also launched a library for SDK and an OCR Api. To be more precise, it’s called PDF Tools API and it converts scans to digital files which contain text, just like any other OCR software for windows. Except that this powerful ocr tool can be integrated into your own application.
Any user that scans texts can have them in digital data, for printed or handwritten characters. Actually, handwriting is still a bit off limits nowadays, although it’s the same with Omnipage Ultimate or Abbyy Finereader Pro. Some online ocr tools may perform better for both printed and handwritten text, but the Adobe Acrobat Reader DC Pro will not work. The online alternatives are Abbyy cloud, Microsoft Azure OCR or even Google Cloud Vision.