We have taken the time to do a test project for newspaper scanning so you know what to do for your next project. I chose 2 separate scenarios, one in which your newspapers are delivered in a bound format, just like books, and the other when newspapers are brought into unbound format, just like you would buy them off the shelf.
For this test we will look at the following things, mechanics of scanning, as in how to scan depending on bound or unbound materials. Also, the image quality and how that affects OCR. Last but not least, we will look at how to interact with the data after we scan them.
|Model||Scanning Size||Resolution||Speed||Special Features||Price|
|91 cm or 36″ width|
by unlimited length
|up to 600dpi|
|200 dpi – 18.9 m/min (12.4 inch/s)|
300 dpi – 12.6 m/min (8.3 inch/s)
400 dpi – 4.7 m/min (3.1 inch/s)
|Check the price|
|2||Bound Newspaper Scanner||635 x 850 mm|
(25 x 33.5 inches), 8% > DIN A1
|up to 600dpi|
|DIN A1+ @ 200 dpi: 1.5 s|
DIN A1+ @ 300 dpi: 2.1 s
DIN A1+ @ 400 dpi: 2.8 s
|Check the price|
Scanning bound newspapers
As I already mentioned, we prepared 2 scenarios. The first test involved getting a really large book where around 500 pages of newspapers were all bound together. I must say, it’s one of the biggest books I have seen in my life. When opened it’s almost 1m wide and more than 50cm in height. So, we ran around the office to find the perfect device for it. Given that our Vshape A1 scanner was not in the office, we tried it on the A1 book scanner with a 180 degree flattening glass.
I had to use the self leveling book cradle, given the gutter was around 6m thick. To understand this better, think of a very thick book and how difficult it is to flatten it. The compensation adjustment in the cradle allows us to improve the contact with the glass and the document will turn out relativelly well pressed. I must admit, the Vshape is much better for such an application, as it would have gone way deeper into the gutter of the book. In my case, I had to test on really tightly bound newspaper, could have done with a bit more space to be honest. All in all, the results were ok, sometimes with a bit of a wave into the gutter, but I guess this is to be expected.
Scanning unbound newspapers
This test was a bit more straightforward. I wish we still had the duplex large format scanner, that was a great device that could capture both sides at once. Anyway, we had to use the simplex one, and what we did was to scan the page in the middle first, on both sides. We then went on with the rest, all the way up to the first sheet. We do have a small algorithm that calculates the total number of pages, and as long as we always scanned them in the same order, the algorithm then crops each one, arranges the pages in the logical order and renumbers them.
What I have noticed, even with the duplex scanner, the process seem to take a bit longer than with the bound book scanning process. Especially since on our vshape bookscanner, we are using mirroless cameras, the process is significantly quicker than on the large format scanner. And pages are always scanned in the logical order, so there are less risks involved. Probably the only downside is the fact that on the large format scanner we do have a pressure roller, which flattens the document perfectly, and the quality is much better in the end.
Image quality and OCR
So, to the image quality aspects. What we did after the scanning was to process the images and get the best sharpness we can get. To have this we scanned the test batches at 400dpi. We have noticed that especially for newspapers, there is a difference in the sharpness and of course the OCR when using 400dpi instead of 300dpi scanning. It might not seem like a big difference, and for most books it will not be. But I have seen a lot of people in this business not realizing that the actual fonts in a newspaper are smaller than in regular books.
Another thing is the paper used in the printing process. In most cases, the paper turns a bit gray with time. Even in our case, we scanned newspapers that were 2 years old and even those were a bit gray. So what happens is that the gray of the paper is closer to the dark characters, therefore the higher the resolution the bigger the difference between the two.
On the same aspect, what we usually do and what we did with the test batches, was to run a background cleaning procedure post scanning. Of course, we did that with a software we developed and the results were quite good. I think even o the color newspapers, we did not reduce the black levels as we cleaned the background. This turned white and then the difference between the characters and the background was clear enough to get good OCR results.
Speaking of this, the accuracy we had was close to 99.5%, rather the confidence level was close to that. Actually we did not really find mistakes, but I must be honest that we didn’t search every page and every character. All in all, what it’s clear is that enhancing the image after the scanning process will definitely improve the overall accuracy of the OCR process.
Data extraction from newspapers
Some will mistaken data extraction with the OCR process. Actually these 2 are intertwined, and what we usually do is to use the OCR process to extract the data in a structured manner. So what can you expect in this sense. Well, first of all is understanding and automating the recognition of the cover page and last page. Nowadays we use algorithms to do this and separate large scanned batches of newspapers automatically. This is the first step.
The second step is to extract the main data of the newspaper, such as the title, date of the newspaper, issue number and other specific data that is usually on the cover page. Next we have the articles, and we want to collect this again in a structured manner. Each of the article will have a title, a subtitle or even different paragraph layouts. We usually extract these in databases and for each of them we create a separate field. Last but not least, from the test batches we extracted the content, as one of the fields. This means that in the end we recreated the scanned newspaper but in a digital form, where the content can be searched or browsed, but also indexed according to every stakeholders needs.