[sane-devel] Creating searchable PDF with ExactImage 0.6

Rene Rebe rene at exactcode.de
Tue Sep 16 16:59:24 UTC 2008


Hi all,

ExactImage 0.6(.0) now comes with an revamped PDF writer and hocr2pdf
front-end, together with a patch to cuneiform to annotate each  
recognized
glyph with a hOCR-like bounding box, it allows the creation of pretty
exactly positioned, searchable PDF files:

ExactImage:
    http://www.exactcode.de/site/open_source/exactimage/

Cuneiform for Linux:
    https://launchpad.net/cuneiform-linux

Cuneiform annotated HTML patch (includes already committed <>& fix),  
which
is not yet conditional. For merging it it probably should only output
the additional
formating based on some additional command line switch, e.g. --hocr  
instead of
--html or so, but that probably requires changing some 20+ files to  
pass the
information down to the point where the HTML is written:

   http://t2-project.org/packages/cuneiform.html
   http://svn.exactcode.de/t2/trunk/package/graphic/cuneiform/html-hocr.patch

ExactImage hocr2pdf page with some basic information:
   http://www.exactcode.de/site/open_source/exactimage/hocr2pdf/

Basically hocr2pdf accepts the input from STDIN (we could also
add a -h/--html option to read it from a file) and the image from
the filename passed to -i/--input. The resulting PDF filename is
specified with -o/--output.

Additionally -s/--sloppy-text allows grouping of words on a line for
sometimes improved search and cut'n paste results with older
PDF viewers and -n/--no-image to skip the image shadowing the
text to either save storage space or take a look how exactly the
glyphs are positioned.

Have fun, patches and inspiration welcome,
   René

-- 
   René Rebe - ExactCODE GmbH - Europe, Germany, Berlin
   http://exactcode.de | http://t2-project.org | http://rene.rebe.name




More information about the sane-devel mailing list