Skip to main content

View Post [edit]

Poster: molly Date: Mar 3, 2005 7:28am
Forum: toronto Subject: Re: 'processed images?'

What package are you using to do your OCR? Our unproofed OCR comes from making the DJVU derivatives, and admittedly isn't the best around. We do like that it outputs a piece of XML that records the location of the text on each page with bounding boxes.

I'm curious if you are getting better results because you are just using a better OCR package, or because you are using the higher resolution image.

HP Labs is kindly working on getting us automatically indexed searchable PDFs, and they are using Abbyy Fine Reader. Our OCR should improve when those start to stream in to the collection.

At some point, we'd like to create a module to take volunteer's hand corrected OCR and work it back into the XML that DJVU puts out. But all of these tools will come in good time!

Thanks for doing such great work!


Reply [edit]

Poster: Greg Lindahl Date: Apr 16, 2005 2:36pm
Forum: toronto Subject: Re: 'processed images?'

Hopefully you'll use the Finereader OCR in your DJVU files, too -- the PG Distributed Proofreaders community experience with Finereader, and it really rocks. Having only PDFs with the good OCR would be a good start, though. Imagine clustering books using the raw OCR...