Chronicles of The Chronicle, 5
When last we met, The Chronicle digitization project was wending it's merry way along the path to success. With an end in sight, I solicited and drafted a merry band of volunteers to help with the work of compiling jpeg images into PDF files. Stepping up to the proverbial plate were Jim Esten, Peter Evans, Michael Miller, Bill Kasper, Kirk Eppler, Matthew Groves, Mike Darling, Michael Rogen, Paul Morin and Kari (aka The Village Carpenter).
In due course a whole mess of scanned jpeg images became a smaller mess of PDF files. According to their level of comfort and expertise with Acrobat, the files were returned to me with or without OCR. No big deal as I had already set up a Batch Process to crop, reduce file size and OCR the lot of them. By 'the lot of them' I refer to the PDF files, not the volunteers. In a previous post I referred to the new Adobe Clearscan process of OCR. It's an interesting development for Acrobat. Developed by a high end commercial imaging company in Boston, Clearscan produces itsy bitsy files, roughly a quarter of the size they would have been with standard OCR.
In 'old fashioned' OCR, a second layer is created that holds the recognized data. The data is linked back to the original, or visual scan layer. That's how you can search a document, jump to the highlighted text and find what you seek. Accuracy is dependant upon the quality of the text, the quality of the scan and the brains of the OCR engine. In other words, it's a bit of a crap shoot.
Clearscan maps out the actual texty of the scanned image, creates a new image of each character and, in a sense, replaces the character with a real bit of font. A pdf is basically a document made up of images. No text, just images. You have to turn those images into something that resembles text so that we can read it and search for it. In Clearscan, the original text character is replaced by either an exact copy, or, if the font is peculiar or degraded, with as close a character as the software can approximate. Does that all make sense?
In addition, Clearscan doesn't mess with any graphics held within the PDF. So graphics are untouched... a particular problem with regular OCR. Graphics are often degraded during the OCR process for various arcane reasons known only to the people other than myself. The end result is a small file (no bloated extra layer of text), better graphics and happy people.
Back to The Chronicle. I ran the whole mess through a variety of Batchs and now comes the fun part. Vetting each and every page. I'll be checking for clarity, orientation, messed up scans, poor originals, bad OCR, missing pages or missing issues. Whatever needs to be redone or fixed, I'll pull the orignal page, scan it and replace the offending villaine. I expect more problems with the early issues and less with the later.
Soon, very soon, The Chronicle of The Early American Industries Association will be ready for publication as a digital edition on DVD. Needless to say, I'll mention something about it here, on my website and where-ever else I can think of.
Till next, Gary
