Still no E-Z book ripper
February 11th, 2008
Levy: Rip This Book? Not Yet. | Newsweek Voices - Steven Levy | Newsweek.com:
“Then I tested a BookSnap for myself. Short verdict: not a revolution. More a thud than a snap, the device—an ominous three-foot high construction draped with a thick black darkroom-style shade—looks like a Goth puppet theater and weighs 44 pounds. Under the shade is an angled cradle for a book and a glass platen to hold the pages down during scanning. You turn the pages yourself. It costs $1,600, not including the two Canon digital cameras (about $500 each) necessary to capture the page images and send them to your computer, where software transforms the pictures into files that can be read on a screen or an e-book reader. It takes considerable fiddling to get images set up properly. Supposedly, once you get started you can digitize 500 pages per hour, much faster and at higher quality than with flatbed scanners (which are much cheaper but not optimized for book scanning). I never got that far, but I imagine such a feat would require considerable caffeination.”
It’s almost impossible to sell self-digitization to the iPod generation, because - as Levy points out here - it’s so much more labor-intensive than ripping a CD. Even ripping vinyl albums to MP3 is much easier and can also be started and then run mostly unattended. Scanning a book is a tedious process and you can’t really do anything else (well, maybe rip CDs) while you’re doing it. Atiz is commendably trying to get to an appliance model for book scanners, but the BookSnap isn’t it. You’d really need something along the lines of the Kirtas technology for that.
(Via Digitization 101.)
Technorati Tags:
libraries, digitization, e-books
Monster truck info
December 20th, 2007
We have recently begun sending Biodiversity Heritage Library materials to the Internet Archive scanning pod at NYPL. We’re currently trying to get the workflow in place, and so we recently purchased one of these Samson Book Carts to send stuff down. They’re perfect in a lot of ways: rugged, collapsible, huge capacity. Unfortunately, it’s also too tall (by about 4″) to fit in the van we’re using to transport books. I’ve been researching big book carts to no avail - if anyone knows of one similar, but a little shorter, than the samson I’d appreciate knowing about it. Thanks. Isn’t it interesting how 90% of digitization works out to be logistics?
Technorati Tags: digitization, libraries, mpow
Social metadata
November 26th, 2007
What I Learned Today… » Blog Archive » The Return of Everything is Miscellaneous:
…Weinberger touches on the future of the ebook. He talked about how we could collect data from how people read books, the passages they highlight, where people read books and so much more using wireless enabled ebook readers (p.222) - and while it sounds like science fiction - we’re almost there. Kindle has the power of wireless technology - meaning that in theory, Amazon could connect to our readers and collect data. While this sounds scary and like a huge invasion of privacy - imagine the power that this data could provide. Some examples Weinberger has is that you could create a list of books that people most often read at the beach or a list of books people stopped reading 1/2 way through - how cool would that be?
Well, because the only people I can think of who would find that data valuable would be marketers. So I don’t think it would be that cool. And it is scary and a huge invasion of privacy. When the government starts asking Amazon for tracking data on where you and your Kindle were last Tuesday, you probably won’t think it’s very cool either. Especially if you can’t turn it off.
Technorati Tags: amazon, digitization, kindle, ebooks, writing
OCRopus Garden
October 25th, 2007
Ars reviews Google’s OCRopus scanning software. We may play with this a bit internally; everybody seems to use Abbyy, but everyone also seems to think that OCR pretty universally sucks, based on the anecdotal evidence I have heard. What I found especially interesting in this review was the huge difference in results from sans-serif rather than serif text:
The following examples show the typical output quality of OCRopus:
Tpo’ much is takgn, much abjdegi qngi tlpugh we arg not pow Wat strength whipl} in old days Moved earth and heaven; that which we are, We are; QpeAequal_tgmper of hqoic hgarts, E/[ade Qeak by Eirpe ang fqte, lgut strong will To strive, to Seek, to hnd, and not to y{eld.
Tho’ much is taken, much abides; and though We are not now that strength which in old days Moved earth and heaven; that which we are, we are; One equal temper of heroic hearts, Made weak by time and fate, but strong in will To strive, to seek, to find, and not to yield
Night and day. Of course almost everything we would possibly be hoping to OCR would be serif text. Ain’t it allus the way.
Technorati Tags: digital libraries, digital_libraries, digitization, google, google books, libraries, linux, ocr, scanning, ubuntu
Wrighting the rong
October 25th, 2007
While reading a Kevin Kelly post about an HG Wells novel that actually was credited in real scientific work, I saw this graphic:

And thought “Cool! A link to the book in the Internet Archive!” Alas, I was wrong. Not only was the image not linked to the IA copy - the image wasn’t linked to anything - the link later in the post was your standard Amazon Associate link. Disappointing. So I’ll right that wrong here:
Go forth and read freely.
Technorati Tags: archive.org, ia, libraries, oca, openlibrary
OCR services?
October 9th, 2007
As part of a IMLS grant we’re working on, I need to find a company that will OCR and double-key about 165k entries from the Index to American Botanical Literature. The entries are spread over a number of volumes. I already know about Digital Divide Data - they were the company we had originally approached about this project, but that was a while ago, and if there’s any other companies people know of, I’d appreciate hearing from you. Thanks!
Now software questions
June 20th, 2007
Scanners the last time, this time it’s presentation software. Or is that digital library software? Collection management software? Our original pilot project went up on a very old version of Greenstone, and again I am having trouble turning up anything more than Greenstone and CONTENTdm (Perhaps the google-fu is weak in this one.) Our Herbarium uses KE Software’s kEMu for its collections, and while it seems strong in some areas, I have some reservations about its use for digital library collections, mainly that I can’t find a whole lot of libraries using it. (Also, it doesn’t appear to have any MARC support.) Again, is there something I am missing? Are people just using LAMP stacks for this?Are most installations just homegrown? Lots to learn…
After all that moaning
June 20th, 2007
Well, after complaining about ALA, it tuns out that I am going to be giving a brief talk on Monday at the Smithsonian. NYBG is a member of the Biodiversity Heritage Library consortium, and Monday morning there will be a brief program about the consortium at the Smithsonian. I am going to be speaking about NYBG’s digitization planning and some of the issues and challenges we are facing. More info is available on the BHL blog.
Book scanners
June 5th, 2007
MPOW is struggling towards getting digitization off the ground, and one of the things I’ve been looking at are book scanners. We often scan rare or fragile (Italian) material, so smooshing down a book onto a flatbed isn’t acceptable. I was surprised at how few vendors there are to choose from. There’s Kirtas, which makes a high-end machine that can do up to 2400 pages an hour. I saw one demonstrated at the BookExpo at Javits last week, and they’re very cool. The book is held in a cradle, and the pages are turned by means of a puff of air. It works quite well, and it scores very high on the Neat-O Scale. It’s very expensive, though, and we don’t have the necessary volume of material to be scanned to justify buying one of these. We’ve done some outsourcing to Kirtas, and been pleased with the results, but it’s overkill for us.
Then there’s the Atiz BookDrive DIY. Most book scanners have the same basic setup: a scaffolding encloses a platen for the book along with mounts for 2 digital cameras pointed at either page of the book. Atiz sells you the scaffolding and lets you pick the cameras yourself, thus the DIY. Atiz also makes something called the BookDrive, which supposedly enables fully unattended scanning. It’s a fully enclosed unit (reminded me of a toaster oven) that turns the pages of the book via an arm with a mild adhesive on it. It gives me the willies to even consider that.
I love the Scribe scanners that the Internet Archive is using, at least in part because I agree so strongly with the ideology and goals of the project, but again, we don’t have the volume to qualify for an on-site Scribe, and we will probably be doing some outsourcing to NYPL’s Scribe station later this year.
We already use a Minolta book scanner, and the Indus (ours are branded BookEye,) so I know about those already. But I haven’t really been able to find anything else, and you’d think there’d be more out there. Anyone know of any others?

