Book scanning is a the process in which physical bound books are converted to digital file formats. After that they are readable on PC, tablets and other special digital viewers. To digitize a bound book or documents, you will have to take into account certain aspects.
It involves the use of both professional book scanners or documents scanners to convert each page individually, and convert it to a text based format or an image file format. For just about any book digitization project, this article will be of great help.
We will focus on bound book scanning, both for historical documents and also other types of current bound documents. Whether it’s an important archive book, bound legal documents, or other bound document digitizing projects. We will try and give an overview of what book scanning equipment you should choose based on your task, how many pages per hour you should aim for, and what quality the end result should be.
The most important book scanning steps are the following
- Preparation of the bound books according to the chosen book scanning method.
- Scanning each page individually and converting it to an image file.
- Post processing of each image and delivery of accurate images.
- Optical Character Recognition OCR for converting image files to text files
- Quality Control of the digitized version of the book and document
This video of the process will help you understand how book scanning is performed.
Preparation of the book before scanning
Before you start the actual scanning process, your bound books must be prepared accordingly. The first thing to check is the condition of the book. Analyze the book integrity, if the pages are holding well into the binding and if any pages are missing. If everything seems to be in order, you can follow through for specific preparations.
Top tip, when digitizing archives which are valuable, please make sure you can rebind the documents which you unbind. If you can’t or are not allowed to remove the binding, just go make sure you use overhead scanners.
We generally separate the book scanning process into two specific categories: destructive book scanning and non destructive book scanning. The first one is great to use when you are scanning books that don’t have a very high value or you have more than one copy of that book. At the same time, the non destructive book scanning process is ideal for heritage books or books with high value to the owner. This includes historical documents, precious research notebooks or even large format document scanning materials.
Preparing for destructive book scanning
As you are probably aware, the destructive book scanning means that the book is cut before scanning. Before you do that, please inspect if the book contains any dust, dirt or lint. You should really try and clean it, even before cutting the books spine. Find a reliable source from which you can get all the materials you need for removing the gutters. In the case of a lab notebook, please make sure the metal staples are removed as they can damage the cutter and the scanner. Same goes for magazine scanning or even legal documents that are stapled.
After you make sure the book is clean, you can follow through and remove the binding. While there isn’t a specific method to do this, we do recommend you use a high quality ream or stack cutter ( guillotine ). If the book has a lot of pages, it’s recommended to separate it into smaller batches before cutting. This way, you can scan your documents at sizes much closer to the original size of the paper document.
Prior to making the actual cut, please make sure you check how deep the writing goes into the gutter. Not all books are made equal, in some cases the content may go well into the gutter, so you have to be really careful before you make the actual cut. Once the binding is removed, make sure each individual page is loose, as you risk having multifeeds during the scanning process, or to be more precise, the scanner might miss pages during scanning. Remember that each stop is another hurdle into scanning thousands of pages every day.
Preparing books for non destructive scanning
It’s much easier to prepare a book for non destructive scanning. The advice with checking for dirt, lint and dust still remains, and it’s probably as important or if not, maybe even more important. Because most professional book scanners use a flattening glass, after repetitive scanning this glass will have to be cleaned. A professional book document scanner can do wonders, but it still can’t remove stains or other elements that affect books.
The better the book is cleaned before the scanning process, the higher the productivity while scanning. You will have to stop less times to clean the glass, remove the dust or even have to do less rescans in the process. Another thing you must check is the integrity of the book and its pages, as some automatic book scanners will use vacuum to turn the pages. So you need them to stick well together into the binding.
If you are using a portable document camera, you might want to try to stretch the book first to get an even scan. The same thing goes for diy book scanners, that don’t allow for capturing uneven open books. We can recommend a touch screen monitor in these cases, as it is easier to hold down the book with one hand and just touch the screen for capture.
Scanning each page of the book
This is probably the most well know step of the entire book scanning process. In this phase, you either use a specialized professional book scanner, an ADF scanner or your average multifunction device. Page by page, you create a digital copy of each book, outputing the images to PDF, TIFF or Jpeg file formats.
To scan bound material, the process requires extra time and attention compared to cut sheet paper document scanning. You have to convert the book without damaging the gutter or the pages. You also need secure facilities for proper storage before the actual book conversion. A bound volume will require average relative humidity and also proper relative temperature. Not to mention that nondestructive bound book scanning can’t be done with your average image scanners. It will require a proper imaging system, sometimes even high end art scanners. Last but not least, you will require special scanning software that is purposely designed for a bound volume.
Of course, there are different settings to choose from, such as resolution, color or black and white, book size and other various characteristics based on the original book or desired output form. There are numerous blog posts on the web looking in depth at each of these varieties of settings.
Non destructive book scanning
To do non destructive book scanning, you must use a professional book scanner. These can be either manual commercial book scanners or even a robotic book scanner. While it is possible to use a multifunction device, and go page by page, you still risk overstretching the book when scanning it. That method is similar to using flatbed scanners and we don’t recommend that for most books or documents. We will resume using a professional book scanner, as this is probably the correct procedure.
First of all inspect the book size. If the desired book fits the scanner you use, go ahead and lay the bound book on the book cradles, face up. Then go on and choose the appropriate resolution settings. 200dpi should be used when you need a low file size, so you can easily send it to your colleagues. We do recommend a resolution of 300dpi or above for maximum results. 300dpi is probably a very good compromise.
Then look at the layout of the book. Does it have colored images and graphics? In such cases, it is recommended to scan it in color, to preserve the original aspect. When all the settings are done and dusted, go ahead and carefully scan each individual page.
Scanning the Bound Document
If you use an automatic book scanner, maintain a close overview of the process as it goes along. If you need to turn the page manually, go ahead an do it carefully, applying the least pressure on the book. Depending on the speed of the scanner and size of the book, it might take anywhere from 10 minutes for short books, to hours when scanning really large books.
Some aspects you should consider for nondestructive book scanning are the type of scanner and general scanning size. To do large format scanning of books, you will need quite a large book scanner. Just imagine the size of the glass plate that has to flatten an A1 book. You will surely not fit that kind of device in general classrooms, offices, a library, bank you name it. For this kind of projects, special premises have to be created.
The good part : Books remain intact and you will preserve their original condition, making it great for vital records. The operator scans your book without affecting it in the process.
The bad part : It’s quite labor intensive and it will take a bit to scan an entire book. Sometimes the scan quality might be lower. Commercial book scanners cost a lot of money, you can’t really do a proper job using a scansnap sv600.
Destructive book scanning
Destructive book scanning is a faster way of scanning a book. One good thing is that it doesn’t require a professional book document scanner. You can use a high speed ADF scanner, therefore having the freedom to use it for book and document scanning applications. In comparison to overhead scanners or a robotic book scanner, the initial investment is therefore lower. Another good thing is that book digitizing can be done quickly than with commercial book scanners or flatbed scanners.
But it will require you to cut the gutter of the book before scanning. Practically, all the pages have to be removed from the binding before scanning them. Please make sure the scanning size is possible with your scanner. For nondestructive book scanning there are even A0 scanners. For ADF scanning, you can’t really go above A3. There will also be additional steps to be taken before scanning.
One of them is to make sure all pages are loose. If this does not happen, there is a high risk you might have 2 pages scanned at once. Another risk is to damage stuck pages. This can happen when you are not careful, and the scanner rollers separate the stuck pages creating a paper jam. This is less and less of a problem with newer scanners, which have paper jam protection.
Book or magazine scanning when they can be cut
But to describe the working process, the stack of paper is first loaded in the paper chute. From there, the ADF scanner will pull pages one by one, until it finishes the paper batch. If the process goes well, you can visually inspect the images on the monitor, while they are scanned.
This allows for greater scanning accuracy and you can easily notice if there are any problems with the images. When digitizing archives which are very large, you have to make sure you get it right from the start. You don’t want to go back and scan thousands of pages again.
The good part : The process is faster and less labor intensive, being one of the fastest scanning methods for books. With a really fast document scanner, the scanning itself should literally last no more than 2 minutes.
The bad part : To scan the book you must first remove the binding. The book will be damaged prior to scanning. You can’t use this method on valuable books.
Post processing of each image and delivery of accurate images.
Once any book has been scanned, it must go through post processing and image enhancement. Let’s be clear, most of the book scanning is done for OCR purposes. That means optical character recognition or transforming images into editable text.
Some want searchable features, others want to edit the text, others just want to copy and quote parts of those books. The character recognition process is the way to do it. But before jumping into this, you must understand that such a technology has its limitations.
First of all the accuracy of the recognized characters. Not all characters are converted to the correct text. To get close to the maximum accuracy possible, we need the best image quality possible. Usually the higher the resolution, the better it is. But we have seen in practice that after 300dpi, the gains are very small. So we have other tricks up our sleeves that we use.
The most important one is the background removal. Practically we want to get as close as possible to the cleanest background, and as close we can to pure black for the characters. While this is not always possible, this is what we think is the gold standard. Of course, in practice, the closer you are to this the better it will be.
Other important aspects about post processing when scanning book or paper document
Another thing we look for is to have very good sharpness of the image. If you have crisp characters this will always provide you with a very good OCR conversion. It practically makes for easier separation of the color tones between the characters and the background.
Other elements to take into account are rotation of images, deskew and correct cropping. These also play a factor for the ultimate goal of OCR. While the rotation of images is clear to just about anyone, the deskew process is all about straightening images. Either because of skewed scanning or just bad print, it’s much easier to notice skewed content on a screen. Therefore, the deskew process is all about correcting such issues. Last but not least, cropping also allows for improved OCR, but does not play a major factor.
To get really crisp images, the type of scanner you use is critical. Either if it uses a digital camera, or a line camera, it all depends on the even lighting. Add to that a really good capturing line camera and the image quality should be great. The standard digital camera will not yield the same kind of results, but it does have some advantages such as being compatible for both mac and pc.
Remember that diy book scanners use standard digital cameras. The end results might not be what you expect. It might be ok for accounts payable scanning, but not be enough for higher quality scanning, such as colored manuscripts or artwork.
Optical Character Recognition OCR for converting image files to text files
As mentioned in the previous paragraph, we are preparing this process during the post processing of the images. I won’t go over again OCR and what it is. But to keep it brief, optical character recognition is done when images are converted to vector media, and editable text is generated.
Let’s look a bit what we can do once the actual recognition is performed. First of all we have the digital file formats. You can create a WORD file or other editable file formats such as RTF or plain TXT. Word file formats are great for editors or authors that want to modify books and reprint them. Also, we have seen people using word files, prior to launching ebook variants from their paper books.
The second major file format is PDF. It’s actually a fully searchable PDF. Of course, there can be numerous PDF file formats, such as PDF/A for archiving. This is considered to be a high quality PDF file that should be kept as an archival digital copy of any scan. You can select just about any type of PDF file, as long as you have a smart OCR software that allows it.
We also have other file formats, depending on the OCR software we use. For example, some apps allow you to create directly ebooks in different formats from your scan. Mobi, Nook or even EPUB are possible files that can be used for the output file.
For more advanced users, OCR software can be used as a solution to convert scanned documents to structured data. The OCR results are converted to XML or CSV, and then uploaded to different document management platforms.
Quality Control of the scans
The last step of this process is the quality control of the results. We check all the file formats generated and then we determine if they’re fit for purpose.
First of all let’s see for WORD files or general text documents or books. Check and analyze the overall accuracy of the character recognition process. At the same time, make sure text blocks have not moved from one page to another. Last but not least, make sure the layout of word files are matching the layout in the physical book.
Regarding the PDF files, the first thing to check is whether the PDF doesn’t contain errors. This can happen quite often, given that the OCR function is quite resource consuming. Sometimes it may happen that errors occur in the conversion phase and they’re translated to the final PDF.
For other file formats, you must check specific things that are of interest for you. One example is for XML or CSV to meet the formatting specs required by the software where they will be uploaded. To digitize book or documents and upload them successfully in a digital platform, the XML’s and CSV files have to match all the requirements perfectly. Otherwise, you risk search terms not pointing to the correct documents.
Frequently asked questions about book scanning
We want to include an FAQ in our guide, so you can move through this subject a bit quicker than usual. These frequently asked questions will answer most of your doubts or curiosities about this field of work. We are sure that it will be a useful tool for anybody interested in learning more about scanning books.
Should I scan the books at home?
In short, yes, if you can, do it. But the issue is a bit more complicated than you would think. Unlike a standard document scan, to scan the book properly you have to understand the basics. First of all we recommend you use a special book scanner that comes with an appropriate book cradle. For example, when you scan documents, a standard multi function devices might do just fine. For a bound document, you must have at least a document camera, or an entry level scanner for books, such as the Fujitsu Scansnap sv600.
Some people have managed using a simple portable scanner, but trust us, professional book scanning machines will yield superior results. There are various types of book scanners, ranging from different sizes, resolutions or even mechanics involved. While a mobile document scanner will be good for some bound documents, a special book with a larger thickness might prove a hassle to scan.
As long as you stick to the basic mechanics of book scanning machines, you can turn out really good results even at home. We encourage you to try, especially if you can find a good scanner app.
Manual book scanning or automatic book scanning?
Some have asked us why not scan every book on automatic book scanning machines. The answer is that while this would be really good in terms of productivity, sometimes the end result would not be up to scratch. Manual book scanning is recommended when you have a special book or some that are to fragile to work on an automatic scanner for books. Unlike medical records scanning where the actual scanning process is more automated, digital scanning and conversion of books to PDF can be a bit trickier.
Automated types of book scanners are great when you have books that are pretty homogenous in terms of physical character. They will provide really good quality, more or less on par with a special photo scanner. Also, the operator has good control of the process and the least amount of pressure is put on the actual book. The end results will be superior to a portable scanner or a mobile document scanner, so choosing a manual book scanner is pretty much a no brainer. Scanning books to PDF is much easier, especially when comparing to standard scanners or scanning devices.
Both of these devices should allow for professional nondestructive scanning of books. It allows you to scan images at high resolution with improvements in terms of productivity. This is what we have seen over the years, that a special book scanner is better for digital scanning of books, when compared to conventional devices.
What file formats should i use for nondestructive scanning?
First of all you should know that initially, a scanning supplier scans your book into a raster format, Jpeg or Tiff in most cases. It doesn’t matter if it’s small or large format scanning, he will fine tune the process to output high quality image files. After scanning, he will convert the images into various file formats.
For example, the most common file format for digital scanning is PDF. Whether it’s PDF standard, single or multi page, PDF/A for archiving or PDF searchable with OCR. The other file formats available are word, rtf and other TXT file formats. These are great when you want to do further editing on these files.
After that we can see the famous ebook format. This can be an Epub, kindle nook and lately audiobook, mp3 or other sound based file formats. To digitize your books into an ebook format successfully, you will have to take into account certain things. Editing is important, because you will have to view the file on various devices. So fine tune the process to get the most out of the scanning and digitization of your bound documents.
Which scanner for vital records or outofprint books?
Especially for vital records scanning, or very rare books ( outofprint books ), we recommend to be really careful during the scanning process. Using nondestructive scanning is mandatory. It’s not really that important whether it’s an automatic or manual book scanner, as long as you adapt the condition of the book to the scanning process.
Mostly, especially for really fragile books, we recommend you use a manual book scanner. Other books, even though outofprint, are still in good condition, so an automated book scanner can still be used without any issues.