blog.mignault.net

OCRopus Garden

Ars reviews Google’s OCRopus scanning software. We may play with this a bit internally; everybody seems to use Abbyy, but everyone also seems to think that OCR pretty universally sucks, based on the anecdotal evidence I have heard. What I found especially interesting in this review was the huge difference in results from sans-serif rather than serif text:

The following examples show the typical output quality of OCRopus:

Tpo’ much is takgn, much abjdegi qngi tlpugh we arg not pow Wat strength whipl} in old days Moved earth and heaven; that which we are, We are; QpeAequal_tgmper of hqoic hgarts, E/[ade Qeak by Eirpe ang fqte, lgut strong will To strive, to Seek, to hnd, and not to y{eld.

Tho’ much is taken, much abides; and though We are not now that strength which in old days Moved earth and heaven; that which we are, we are; One equal temper of heroic hearts, Made weak by time and fate, but strong in will To strive, to seek, to find, and not to yield

Night and day. Of course almost everything we would possibly be hoping to OCR would be serif text. Ain’t it allus the way.

Technorati Tags: digital libraries, digital_libraries, digitization, google, google books, libraries, linux, ocr, scanning, ubuntu

RSS Feed. This blog is proudly powered by Wordpress and uses Modern Clix, a theme by Rodrigo Galindez.

Your favorite tagline still sucks

You are reading

OCRopus Garden

About

Photos

Search

Categories

Recent Comments

Recent Posts