[odf-discuss] Mars: XMLisation of PDF - opportunity for ODF?

Jerry Askew jerry at askew.net
Thu Nov 9 13:52:22 EST 2006


Hi all,

I have been lurking and haven't contributed in a while, but I thought I may have some value to add here...

Disclaimer:  This is to the best of my knowledge.  Please correct me if something is inaccurate.  I'll argue with you only if it's something I know is right :)

PDFs tend to come in two flavors:  Those generated by a rendering engine (like Adobe Distiller or Ghostscript), and those created by a scanner.

The former type is the "classic" PDF and is generally the result of a print operation or the conversion of PostScript content.  These consist of text and images.  In reality, the PDF consists of glyphs, images and positioning information.  The glyphs are individual characters in whatever language you are using, but there is not necessarily a connection between the glyph map and a character set.  AFAIK, there is no "glyph" for a space.  Spaces are achieved through the position of characters relative to each other.  These two factors make is quite challenging to extract meaningful text from a PDF.  Word separations must be inferred and glyph numbers may not map to a character set.  Fortunately, most PDFs do arrange the glyphs so that they correspond to an ISO character set.  There are many that do not - if you try to copy/paste from these, you will just get garbage.  Text search will generally not work in these either.  Using fonts other than the Adobe Type 1 (or is it Type A?) fonts will typically elicit this behavior.

The latter type (those created by a scanner) are simply a big image encapsulated in a PDF wrapper.  There are a number of programs (Acrobat Full, OmniPage, etc.) that will OCR such a scan and then create an invisible overlay that contains the text.  This allows full text search and copy/paste from a scanned PDF.

I tend to think of PDF as a paper equivalent and if you are lucky, you might be able to index and/or extract information (an OCR step may be required, though - even in some instances of the first type of PDF).

-Jerry

Jerry Askew
Askew Network Solutions
www.askew.net



> -----Original Message-----
> From: odf-discuss-bounces at opendocumentfellowship.org 
> [mailto:odf-discuss-bounces at opendocumentfellowship.org] On 
> Behalf Of J David Eisenberg
> Sent: Thursday, November 09, 2006 7:39 AM
> To: ODF Discussion List
> Subject: Re: [odf-discuss] Mars: XMLisation of PDF - 
> opportunity for ODF?
> 
> On Thu, 9 Nov 2006, [iso-8859-15] Lars D. Noodén wrote:
> 
> > On Thu, 9 Nov 2006, Arend van Beelen wrote:
> > > There are certainly advantages thinkable.  For example, a big 
> > > advantage PDF has over its paper counterpart is that it can be 
> > > searched by a computer and PDF's can be indexed into 
> search engines. 
> > > Now extracting searchable text from an XML-based format is 
> > > incredibly more easy than it is to write a PDF-parser and 
> search the text you can get out of that.
> > 
> > Yes it is probably easier to parse text out of an XML document, but 
> > PDF is just a wrapper (if I understand correctly) and often holds a 
> > bitmapped image or other, from a searching perspective, 
> useless material.
> 
> Some PDFs are bitmapped images; others contain text, which is 
> in some compressed form.  The ps2ascii tool on Linux will 
> extract such text from a PDF quite nicely.
> 
> --
> J. David Eisenberg  http://catcode.com/
> 
> _______________________________________________
> odf-discuss mailing list
> odf-discuss at opendocumentfellowship.org
> http://lists.opendocumentfellowship.org/mailman/listinfo/odf-discuss
> 
> 



More information about the odf-discuss mailing list