OCRopus Garden
October 25th, 2007
Ars reviews Google’s OCRopus scanning software. We may play with this a bit internally; everybody seems to use Abbyy, but everyone also seems to think that OCR pretty universally sucks, based on the anecdotal evidence I have heard. What I found especially interesting in this review was the huge difference in results from sans-serif rather than serif text:
The following examples show the typical output quality of OCRopus:
Tpo’ much is takgn, much abjdegi qngi tlpugh we arg not pow Wat strength whipl} in old days Moved earth and heaven; that which we are, We are; QpeAequal_tgmper of hqoic hgarts, E/[ade Qeak by Eirpe ang fqte, lgut strong will To strive, to Seek, to hnd, and not to y{eld.
Tho’ much is taken, much abides; and though We are not now that strength which in old days Moved earth and heaven; that which we are, we are; One equal temper of heroic hearts, Made weak by time and fate, but strong in will To strive, to seek, to find, and not to yield
Night and day. Of course almost everything we would possibly be hoping to OCR would be serif text. Ain’t it allus the way.
Technorati Tags: digital libraries, digital_libraries, digitization, google, google books, libraries, linux, ocr, scanning, ubuntu
Gmail IMAP watch
October 25th, 2007
Not what we had in mind
September 27th, 2007
Hmmm, might have to get rid of the Google ads, too:

Updated: Now this is more like it:

Technorati Tags: adsense, google, vegan, vegetarian
Slip of the hand
September 27th, 2007
Seen this morning whilst goofing round in Google Book Search:

Technorati Tags: digital libraries, digitization, google books, libraries

