Archived entries for digitization

Free the teeming millions (of bits)

Free the Linked Data 4:

[I should have blogged about this general thought before I jumped ahead in my previous post with a URI pattern proposal. It is more important for people to embrace these principles than it is to mindlessly buy into various constraint models.]

In Linked Data, Tim Berners-Lee points out that “It is the unexpected re-use of information which is the value added by the web.” Four rules are given to facilitate unexpected re-use:

  1. Use URIs as names for things
  2. Use HTTP URIs so that people can look up those names
  3. When someone looks up a URI, provide useful information
  4. Include links to other URIs, so that they can discover more things

Despite the “Linked Data” analysis, the principle of unexpected re-use and these four rules can be applied to HTTP in general without an RDF basis.

I try to keep the ‘unexpected re-use of information’ in mind. You can’t even begin to anticipate every possible use of your data people can come up with. So the best thing to do is get out of their way as much as possible and give them the access to create something new and undreamed-of. Be generous in what you provide and unfettered in what you expect.

(Via Q6.)

Still no E-Z book ripper

Levy: Rip This Book? Not Yet. | Newsweek Voices – Steven Levy | Newsweek.com:

“Then I tested a BookSnap for myself. Short verdict: not a revolution. More a thud than a snap, the device—an ominous three-foot high construction draped with a thick black darkroom-style shade—looks like a Goth puppet theater and weighs 44 pounds. Under the shade is an angled cradle for a book and a glass platen to hold the pages down during scanning. You turn the pages yourself. It costs $1,600, not including the two Canon digital cameras (about $500 each) necessary to capture the page images and send them to your computer, where software transforms the pictures into files that can be read on a screen or an e-book reader. It takes considerable fiddling to get images set up properly. Supposedly, once you get started you can digitize 500 pages per hour, much faster and at higher quality than with flatbed scanners (which are much cheaper but not optimized for book scanning). I never got that far, but I imagine such a feat would require considerable caffeination.”

It’s almost impossible to sell self-digitization to the iPod generation, because – as Levy points out here – it’s so much more labor-intensive than ripping a CD. Even ripping vinyl albums to MP3 is much easier and can also be started and then run mostly unattended. Scanning a book is a tedious process and you can’t really do anything else (well, maybe rip CDs) while you’re doing it. Atiz is commendably trying to get to an appliance model for book scanners, but the BookSnap isn’t it. You’d really need something along the lines of the Kirtas technology for that.

(Via Digitization 101.)

Technorati Tags:
, ,

Monster truck info

We have recently begun sending Biodiversity Heritage Library materials to the Internet Archive scanning pod at NYPL. We’re currently trying to get the workflow in place, and so we recently purchased one of these Samson Book Carts to send stuff down. They’re perfect in a lot of ways: rugged, collapsible, huge capacity. Unfortunately, it’s also too tall (by about 4″) to fit in the van we’re using to transport books. I’ve been researching big book carts to no avail – if anyone knows of one similar, but a little shorter, than the samson I’d appreciate knowing about it. Thanks. Isn’t it interesting how 90% of digitization works out to be logistics?

Technorati Tags: , ,

Social metadata

What I Learned Today… » Blog Archive » The Return of Everything is Miscellaneous:

…Weinberger touches on the future of the ebook. He talked about how we could collect data from how people read books, the passages they highlight, where people read books and so much more using wireless enabled ebook readers (p.222) – and while it sounds like science fiction – we’re almost there. Kindle has the power of wireless technology – meaning that in theory, Amazon could connect to our readers and collect data. While this sounds scary and like a huge invasion of privacy – imagine the power that this data could provide. Some examples Weinberger has is that you could create a list of books that people most often read at the beach or a list of books people stopped reading 1/2 way through – how cool would that be?

Well, because the only people I can think of who would find that data valuable would be marketers. So I don’t think it would be that cool. And it is scary and a huge invasion of privacy. When the government starts asking Amazon for tracking data on where you and your Kindle were last Tuesday, you probably won’t think it’s very cool either. Especially if you can’t turn it off.

Technorati Tags: , , , ,

OCRopus Garden

Ars reviews Google’s OCRopus scanning software. We may play with this a bit internally; everybody seems to use Abbyy, but everyone also seems to think that OCR pretty universally sucks, based on the anecdotal evidence I have heard. What I found especially interesting in this review was the huge difference in results from sans-serif rather than serif text:

The following examples show the typical output quality of OCRopus:


Tpo’ much is takgn, much abjdegi qngi tlpugh we arg not pow Wat strength whipl} in old days Moved earth and heaven; that which we are, We are; QpeAequal_tgmper of hqoic hgarts, E/[ade Qeak by Eirpe ang fqte, lgut strong will To strive, to Seek, to hnd, and not to y{eld.


Tho’ much is taken, much abides; and though We are not now that strength which in old days Moved earth and heaven; that which we are, we are; One equal temper of heroic hearts, Made weak by time and fate, but strong in will To strive, to seek, to find, and not to yield

Night and day. Of course almost everything we would possibly be hoping to OCR would be serif text. Ain’t it allus the way.

Technorati Tags: , , , , , , , , ,

Wrighting the rong

While reading a Kevin Kelly post about an HG Wells novel that actually was credited in real scientific work, I saw this graphic:

World Set Free

And thought “Cool! A link to the book in the Internet Archive!” Alas, I was wrong. Not only was the image not linked to the IA copy – the image wasn’t linked to anything – the link later in the post was your standard Amazon Associate link. Disappointing. So I’ll right that wrong here:

World Set Free

Go forth and read freely.

Technorati Tags: , , , ,

OCR services?

As part of a IMLS grant we’re working on, I need to find a company that will OCR and double-key about 165k entries from the Index to American Botanical Literature. The entries are spread over a number of volumes. I already know about Digital Divide Data – they were the company we had originally approached about this project, but that was a while ago, and if there’s any other companies people know of, I’d appreciate hearing from you. Thanks!

Now software questions

Scanners the last time, this time it’s presentation software. Or is that digital library software? Collection management software? Our original pilot project went up on a very old version of Greenstone, and again I am having trouble turning up anything more than Greenstone and CONTENTdm (Perhaps the google-fu is weak in this one.) Our Herbarium uses KE Software’s kEMu for its collections, and while it seems strong in some areas, I have some reservations about its use for digital library collections, mainly that I can’t find a whole lot of libraries using it. (Also, it doesn’t appear to have any MARC support.) Again, is there something I am missing? Are people just using LAMP stacks for this?Are most installations just homegrown? Lots to learn…

After all that moaning

Well, after complaining about ALA, it tuns out that I am going to be giving a brief talk on Monday at the Smithsonian. NYBG is a member of the Biodiversity Heritage Library consortium, and Monday morning there will be a brief program about the consortium at the Smithsonian. I am going to be speaking about NYBG’s digitization planning and some of the issues and challenges we are facing. More info is available on the BHL blog.

Book scanners

MPOW is struggling towards getting digitization off the ground, and one of the things I’ve been looking at are book scanners. We often scan rare or fragile (Italian) material, so smooshing down a book onto a flatbed isn’t acceptable. I was surprised at how few vendors there are to choose from. There’s Kirtas, which makes a high-end machine that can do up to 2400 pages an hour. I saw one demonstrated at the BookExpo at Javits last week, and they’re very cool. The book is held in a cradle, and the pages are turned by means of a puff of air. It works quite well, and it scores very high on the Neat-O Scale. It’s very expensive, though, and we don’t have the necessary volume of material to be scanned to justify buying one of these. We’ve done some outsourcing to Kirtas, and been pleased with the results, but it’s overkill for us.

Then there’s the Atiz BookDrive DIY. Most book scanners have the same basic setup: a scaffolding encloses a platen for the book along with mounts for 2 digital cameras pointed at either page of the book. Atiz sells you the scaffolding and lets you pick the cameras yourself, thus the DIY. Atiz also makes something called the BookDrive, which supposedly enables fully unattended scanning. It’s a fully enclosed unit (reminded me of a toaster oven) that turns the pages of the book via an arm with a mild adhesive on it. It gives me the willies to even consider that.

I love the Scribe scanners that the Internet Archive is using, at least in part because I agree so strongly with the ideology and goals of the project, but again, we don’t have the volume to qualify for an on-site Scribe, and we will probably be doing some outsourcing to NYPL’s Scribe station later this year.

We already use a Minolta book scanner, and the Indus (ours are branded BookEye,) so I know about those already. But I haven’t really been able to find anything else, and you’d think there’d be more out there. Anyone know of any others?



Copyright © 2004–2009. All rights reserved.

RSS Feed. This blog is proudly powered by Wordpress and uses Modern Clix, a theme by Rodrigo Galindez.