OCRopus Garden
Ars reviews Google’s OCRopus scanning software. We may play with this a bit internally; everybody seems to use Abbyy, but everyone also seems to think that OCR pretty universally sucks, based on the anecdotal evidence I have heard. What I found especially interesting in this review was the huge difference in results from sans-serif rather than serif text:
The following examples show the typical output quality of OCRopus:
Tpo’ much is takgn, much abjdegi qngi tlpugh we arg not pow Wat strength whipl} in old days Moved earth and heaven; that which we are, We are; QpeAequal_tgmper of hqoic hgarts, E/[ade Qeak by Eirpe ang fqte, lgut strong will To strive, to Seek, to hnd, and not to y{eld.
Tho’ much is taken, much abides; and though We are not now that strength which in old days Moved earth and heaven; that which we are, we are; One equal temper of heroic hearts, Made weak by time and fate, but strong in will To strive, to seek, to find, and not to yield
Night and day. Of course almost everything we would possibly be hoping to OCR would be serif text. Ain’t it allus the way.
Technorati Tags: digital libraries, digital_libraries, digitization, google, google books, libraries, linux, ocr, scanning, ubuntu
The Ars review mentions Aspell and language data files, which suggests that OCRopus should be able to at least recognize that its output is wrong: if OCR’d words don’t match a dictionary file, it should flag them as problems. I’ve no OCR experience; is that how it normally works?
Not sure; we don’t really have any yet either. My guess would be that it probably does in some way given the recent distributed Captcha stuff to help improve IA’s OCR. Anyone else want to chime in here?