[odf-discuss] A story of TIFF

Peter Vandenabeele peter at vandenabeele.com
Sun Feb 4 23:54:01 EST 2007


On 2/5/07, marbux <marbux at gmail.com> wrote:
> On 2/3/07, Daniel Carrera <daniel.carrera at zmsl.com> wrote:
> > On Sat, 2007-02-03 at 04:06 -0800, marbux wrote:
...
> > > My
> > > understanding is that the current version using the proposed ODF 1.2
> > > changes already achieves near-perfect interop in both directions
> >
> > How is that possible considering that ODF 1.2 doesn't exist yet, much
> > less it's implemented in OOo. Are we using the same definition of
> > "interoperability"?
>
> I was clear in what I said. ...

For my understanding, could you please explain the specific experiments
that you refer to in the sentence "achieves near-perfect interop in both
directions". Which data format(s), which application(s), how was the interop
tested, where was the result imperfect that it warrants the word
"near-perfect" ?

> > Are you saying that the da Vinci plugin does *not* insert unknown binary
> > blogs ("dark objects") into ODF files which cannot be understood by
> > other applications?
>
> Darn few. Most have been cracked. With five proposed extensions to ODF
> in v.1.2, the remainder can be cracked over time, ...

What I would like to understand is the separate and the combined
advantages of :

* the use of the RTF API and/or the in memory representation
* the use of "dark objects", accompagnied by descriptions in RDF

Could you please correct the following scenario. It is quite a long description,
a tought experiment in trying to understand what exactly happens. Thanks
for checking it.

<tought experiment, not exact, just speculation>

* RTF API:
   the plug-in uses an internal RTF API of MS Office to obtain more detailed
   information about the in memory representation of the document. I would
   assume that the in-memory representation is complete, but it may not be
   possible to fully interprete or reverse engineer all the info and it may (or
   not) be possible to extract all information through the RTF API ?
Or is it fully
   documented or reverse engineered and fully addressable through the RTF
   API ?

   The assumed advantage is that by hooking up "live" with the MS Office
   application (while it is running), you can obtain more information (e.g. by
   querying over the RTF API, inspecting the in memory representation), than
   by only "post mortem" analyzing what the MS Office application eventually
   saved (in .doc , .docx or whatever).

   [but on the other hand, the file that is saved in the EOOXML format is also
    quite largely specified, except for some legacy tags on page 2199 etc. ;
    and it certainly allows a full fidelity representation (in the sense that if
    MS Office opens the same file back later, it should see the same document).
    Why would this "live" approach over an older "RTF" interface be better ?
    The Excel 2007 binary infopath story below might be an argument ?]

* dark objects / RDF:
   the plug-in will:
   * store "dark objects" as a baseline for guaranteed fidelity
      [which could also be achieved with straight EOOXML, but anyway]
   * add additional information (the RDF stuff), so that other applications
      have a better chance (but not guaranteed 100% certainty) at
      "decoding" the information in the binary blob
   * as time passes on and the reverse engineering of the "dark objects"
      improves, the amount of required "dark matter" will reduce and the
      quality of the RDF description will improve, improving the chance of
      decoding the "dark matter".

Now the interesting part is the combination of the two (the multiplicative
effect). First, you do have this "live" query of the MS Office application,
giving you more info than you could obtain with a "post mortem" on the
saved file. But what do you do with that additional information ? One thing
you could do is just straight save it out (actually, is that the ECMA file
format ??). But one other thing you could do, is use it to Save As in an
ODF format, great! Problem seems to be that not everything that is obtained
from step 1 can is guaranteed to be really interpreted exactly, so as a
fall-back, the plug-in makes a "backup copy" in a binary blob of the info, so if
all else fails, at least the document can be recovered from there. But there is
a  "primary copy" that is the RDF description and should increasingly better
describe the content of the blob.

One particular problem that I see is that an external application could "read"
the RDF descriptions of the dark matter and try to render it, but could never be
allowed to change it, since then the match between the RDF and the
"backup copy" in the binary blob would be broken. Others have argued here that
that is exactly the way an XML extension must be handled: only the native
application is allowed to change it, since other applications might break it if
they edit it.

What is the advantage of the 2 combined ? I don't see clear yet. Indeed, as
Daniel proposes as a "tought experiment": taking EOOXML as the starting
point, translating everything that is possible to ODF and wrapping everything
that could not be translated into "foreign data", seems quite similar. Why is
this combination of "live" connection to the RTF API and using ODF + dark
matter to store the result better than that ?

</tought experiment, not exact, just speculation>

...
> The new Excel 2007 binary infoset format is strong testimony
> that the in-memory binary representations of the Office file formats
> and their dumps to binary files are the most stable interop targets
> for development.

Could you point me to more details on the Excel 2007 binary infoset
format please. Are you saying that MS Excel 2007 is saving data in a
binary format that is not documented in detail in  EOOXML ?

> A blob is a blob is a blob, whether wrapped with MOOXML tags or ODF
> tags. But targeting the in-memory binary representations and their
> dumps to file is in my mind the only practical approach to the
> problem. I still haven't heard any reason to believe otherwise.

The reasoning of Daniel was that at least EOOXML by large has all features
specified (except for some legacy tags on page 2199 etc.), whereas the
RTF representation seems to need a larger amount of binary blobs.

Now you seem to indicate that already at this time the Excel 2007 binary
infoset format, creates files in which parts of the information are not
fully specified in EOOXML ?

> You do in your proposed solution, implicitly, as discussed above. The
> Foundation folk are the only people I know of who even have a goal of
> full interop that addresses the problem of the dark objects. You have
> overlooked that your own proposed solution involves dark objects too.

But the claim of Daniel is that there would be very little "dark objects",
since EOOXML largely specifies the document. And, even if not everything
can be translated to ODF, those elements _are_ specified (in EOOXML),
so an application that understands EOOXML could actually do something
with those blobs.

The similarity is clear with: "an application that understands the RDF
descriptions in ODFX could do something with them."

Do I understand correctly that you claim that the amount of data that needs
to be stored in "dark objects" in the Foundation approach will reduce over
time as the reverse engineering of the binary blobs improves, while the approach
amount of data that cannot be translated faithfully between EOOXML and ODF
will remain at least constant (because the specs are frozen) or even go up, as
MS adds binary formats to it.

> They address the problem through the ODF foreign metadata
> tags, a proposed MS Office ODF interop subset, and five proposed ODF
> extensions that allow adequate description of the blobs" content as
> more is learned about them.

The MS Office ODF interop subset is the third technology here.

Do I understand correctly, you want to specify in ODF 1.2 a subset that
will reduce ODF a little and avoid those features that are hard to map to
in the MS Office application and native data formats, so that if all ODF
processing apps would stick to the subset, it would be a lot easier to
render these ODF's in an MS Office application ?

> Do believe Clever Age has solved that problem?

Their detailed spread sheet of translated and non-translated features
clearly admits that not everything is translated. I think there is no discussion
that both formats (ODF and EOOXML) have features that are not easily
(or not possible) to represent into the other format. Or do you see an
asymmetry ?

Thanks,

Peter



More information about the odf-discuss mailing list