[texhax] search for text in a pdf file
Karl Berry
karl at freefriends.org
Sat Aug 7 03:12:05 CEST 2004
> so now i'm back where i started, only just a bit smarter. so what else do
> y'all use to pull text out of a pdf such as this one?
In general, pdftotext from xpdf can be better than pdf2ps | ps2ascii.
But if the text search in xpdf or acrobat doesn't find anything, it
won't help, and OCR is your only hope.
Fortunately there is at least one open source project:
http://jocr.sourceforge.net/
Yep, that's a big one. There are other.
Another one is OCRAD, which was offered to GNU, and eventually accepted:
http://www.gnu.org/software/ocrad/ocrad.html
I found these links while evaluating ocrad about a year ago, don't know
if they're still valid, but FWIW:
http://www.claraocr.org
http://lem.eui.upm.es/ocre.html
http://www.pattern-lab.de/index_e.html
http://www.math.nwu.edu/~mlerma/locr/
http://http.cs.berkeley.edu/~fateman/kathey/ocrchie.html
I've never tried any of them personally.
Good luck,
k
More information about the texhax
mailing list