Book scanning is a the process in which physical bound books are converted to digital file formats. After that they are readable on PC, tablets and other special digital viewers. To digitize a bound book or documents, you will have to take into account certain aspects.
It involves the use of both professional book scanners or documents scanners to convert each page individually, and convert it to a text based format or an image file format. For just about any book digitization project, this article will be of great help.
The book scanning steps are the following:
- Preparation of the bound books according to the chosen book scanning method.
- Scanning each page individually and converting it to an image file.
- Post processing of each image and delivery of accurate images.
- Optical Character Recognition OCR for converting image files to text files
- Quality Control of the digitized version of the book and document
Preparation of the book before scanning
Before you start the actual scanning process, your bound books must be prepared accordingly. The first thing to check is the condition of the book. Analyze the book integrity, if the pages are holding well into the binding and if any pages are missing. If everything seems to be in order, you can follow through for specific preparations.
We generally separate the book scanning process into two specific categories: destructive book scanning and non destructive book scanning. The first one is great to use when you are scanning books that don’t have a very high value or you have more than one copy of that book. At the same time, the non destructive book scanning process is ideal for heritage books or books with high value to the owner.
Preparing for destructive book scanning
As you are probably aware, the destructive book scanning means that the book is cut before scanning. Before you do that, please inspect if the book contains any dust, dirt or lint. You should really try and clean it, even before cutting the books spine.
After you make sure the book is clean, you can follow through and remove the binding. While there isn’t a specific method to do this, we do recommend you use a high quality ream or stack cutter ( guillotine ). If the book has a lot of pages, it’s recommended to separate it into smaller batches before cutting.
Prior to making the actual cut, please make sure you check how deep the writing goes into the gutter. Not all books are made equal, in some cases the content may go well into the gutter, so you have to be really careful before you make the actual cut. Once the binding is removed, make sure each individual page is loose, as you risk having multifeeds during the scanning process, or to be more precise, the scanner might miss pages during scanning.
Preparing books for non destructive scanning
It’s much easier to prepare a book for non destructive scanning. The advice with checking for dirt, lint and dust still remains, and it’s probably as important or if not, maybe even more important. Because most professional book scanners use a flattening glass, after repetitive scanning this glass will have to be cleaned.
The better the book is cleaned before the scanning process, the higher the productivity while scanning. You will have to stop less times to clean the glass, remove the dust or even have to do less rescans in the process. Another thing you must check is the integrity of the book and its pages, as some automatic book scanners will use vacuum to turn the pages. So you need them to stick well together into the binding.
If you are using a portable document camera, you might want to try to stretch the book first to get an even scan. The same thing goes for diy book scanners, that don’t allow for capturing uneven open books. We can recommend a touch screen monitor in these cases, as it is easier to hold down the book with one hand and just touch the screen for capture.
Scanning each page of the book
This is probably the most well know step of the entire book scanning process. In this phase, you either use a specialized professional book scanner, an ADF scanner or your average multifunction device. Page by page, you create a digital copy of each book, outputing the images to PDF, TIFF or Jpeg file formats.
Of course, there are different settings to choose from, such as resolution, color or black and white, book size and other various characteristics based on the original book or desired output form.
Non destructive book scanning
To do non destructive book scanning, you must use a professional book scanner. These can be either manual commercial book scanners or even a robotic book scanner. While it is possible to use a multifunction device, and go page by page, you still risk overstretching the book when scanning it. That method is similar to using flatbed scanners and we don’t recommend that for most books or documents. We will resume using a professional book scanner, as this is probably the correct procedure.
First of all inspect the book size. If the desired book fits the scanner you use, go ahead and lay the bound book on the book cradles, face up. Then go on and choose the appropriate resolution settings. 200dpi should be used when you need a low file size, so you can easily send it to your colleagues. We do recommend a resolution of 300dpi or above for maximum results. 300dpi is probably a very good compromise.
Then look at the layout of the book. Does it have colored images and graphics? In such cases, it is recommended to scan it in color, to preserve the original aspect. When all the settings are done and dusted, go ahead and carefully scan each individual page.
If you use an automatic book scanner, maintain a close overview of the process as it goes along. If you need to turn the page manually, go ahead an do it carefully, applying the least pressure on the book. Depending on the speed of the scanner and size of the book, it might take anywhere from 10 minutes for short books, to hours when scanning really large books.
Some aspects you should consider for nondestructive book scanning are the type of scanner and general scanning size. To do large format scanning of books, you will need quite a large book scanner. Just imagine the size of the glass plate that has to flatten an A1 book. You will surely not fit that kind of device in general classrooms, offices, a library, bank you name it. For this kind of projects, special premises have to be created.
The good part : Books remain intact and you will preserve their original condition.
The bad part : It’s quite labor intensive and it will take a bit to scan an entire book. Sometimes the scan quality might be lower.
Destructive book scanning
Destructive book scanning is a faster way of scanning a book. But it will require you to cut the gutter of the book before scanning. Practically, all the pages have to be removed from the binding before scanning them. There will be also additional steps to be taken before scanning.
One of them is to make sure all pages are loose. If this does not happen, there is a high risk you might have 2 pages scanned at once. Another risk is to damage stuck pages. This can happen when you are not careful, and the scanner rollers separate the stuck pages creating a paper jam. This is less and less of a problem with newer scanners, which have paper jam protection.
But to describe the working process, the stack of paper is first loaded in the paper chute. From there, the ADF scanner will pull pages one by one, until it finishes the paper batch. If the process goes well, you can visually inspect the images on the monitor, while they are scanned. This allows for greater scanning accuracy and you can easily notice if there are any problems with the images.
The good part : The process is faster and less labor intensive, being one of the fastest scanning methods for books. With a really fast document scanner, the scanning itself should literally last no more than 2 minutes.
The bad part : To scan the book you must first remove the binding. The book will be damaged prior to scanning. You can’t use this method on valuable books.
Post processing of each image and delivery of accurate images.
Once any book has been scanned, it must go through post processing and image enhancement. Let’s be clear, most of the book scanning is done for OCR purposes. That means optical character recognition or transforming images into editable text.
Some want searchable features, others want to edit the text, others just want to copy and quote parts of those books. The character recognition process is the way to do it. But before jumping into this, you must understand that such a technology has its limitations.
First of all the accuracy of the recognized characters. Not all characters are converted to the correct text. To get close to the maximum accuracy possible, we need the best image quality possible. Usually the higher the resolution, the better it is. But we have seen in practice that after 300dpi, the gains are very small. So we have other tricks up our sleeves that we use.
The most important one is the background removal. Practically we want to get as close as possible to the cleanest background, and as close we can to pure black for the characters. While this is not always possible, this is what we think is the gold standard. Of course, in practice, the closer you are to this the better it will be.
The second thing we look for is to have very good sharpness of the image. If you have crisp characters this will always provide you with a very good OCR conversion. It practically makes for easier separation of the color tones between the characters and the background.
Other elements to take into account are rotation of images, deskew and correct cropping. These also play a factor for the ultimate goal of OCR. While the rotation of images is clear to just about anyone, the deskew process is all about straightening images. Either because of skewed scanning or just bad print, it’s much easier to notice skewed content on a screen. Therefore, the deskew process is all about correcting such issues. Last but not least, cropping also allows for improved OCR, but does not play a major factor.
To get really crisp images, the type of scanner you use is critical. Either if it uses a digital camera, or a line camera, it all depends on the even lighting. Add to that a really good capturing line camera and the image quality should be great. The standard digital camera will not yield the same kind of results, but it does have some advantages such as being compatible for both mac and pc.
Optical Character Recognition OCR for converting image files to text files
As mentioned in the previous paragraph, we are preparing this process during the post processing of the images. I won’t go over again OCR and what it is. But to keep it brief, optical character recognition is done when images are converted to vector media, and editable text is generated.
Let’s look a bit what we can do once the actual recognition is performed. First of all we have the digital file formats. You can create a WORD file or other editable file formats such as RTF or plain TXT. Word file formats are great for editors or authors that want to modify books and reprint them. Also, we have seen people using word files, prior to launching ebook variants from their paper books.
The second major file format is PDF. It’s actually a fully searchable PDF. Of course, there can be numerous PDF file formats, such as PDF/A for archiving. This is considered to be a high quality PDF file that should be kept as an archival digital copy of any scan. You can select just about any type of PDF file, as long as the OCR software allows it.
We also have other file formats, depending on the OCR software we use. For example, some apps allow you to create directly ebooks in different formats from your scan. Mobi, Nook or even EPUB are possible files that can be used for the output file.
For more advanced users, OCR software can be used as a solution to convert scanned documents to structured data. The OCR results are converted to XML or CSV, and then uploaded to different document management platforms.
Quality Control of the scans
The last step of this process is the quality control of the results. This includes checking all of the file formats generated and determining if they are fit for purpose.
First of all let’s see for WORD files or general text documents or books. Check and analyze the overall accuracy of the character recognition process. At the same time, make sure text blocks have not moved from one page to another. Last but not least, make sure the layout of word files are matching the layout in the physical book.
Regarding the PDF files, the first thing to check is whether the PDF doesn’t contain errors. This can happen quite often, given that the OCR function is quite resource consuming. Sometimes it may happen that errors occur in the conversion phase and they’re translated to the final PDF.
For other file formats, you must check specific things that are of interest for you. One example is for XML or CSV to meet the formatting specs required by the software where they will be uploaded. To digitize book or documents and upload them successfully in a digital platform, the XML’s and CSV files have to match all the requirements perfectly. Otherwise, you risk search terms not pointing to the correct documents.