[odf-discuss] A story of TIFF

marbux marbux at gmail.com
Sun Feb 11 18:23:16 EST 2007


Before I launch, everything I say below is either my own speculation
or is based on what I've been told by the Foundation developers.
Microsoft Office has never sullied my system's reputation and never
will. :-)

On 2/4/07, Peter Vandenabeele <peter at vandenabeele.com> wrote:
> For my understanding, could you please explain the specific experiments
> that you refer to in the sentence "achieves near-perfect interop in both
> directions". Which data format(s), which application(s), how was the interop
> tested, where was the result imperfect that it warrants the word
> "near-perfect" ?
>
You'd be far better off talking to the real experts on that. I've
given some additional information I'm reasonably sure of in response
to Daniel's post that should clear things up a bit.

> What I would like to understand is the separate and the combined
> advantages of :
>
> * the use of the RTF API and/or the in memory representation
> * the use of "dark objects", accompagnied by descriptions in RDF
>
 I think I've covered that in the response to Daniel.

> Could you please correct the following scenario. It is quite a long description,
> a tought experiment in trying to understand what exactly happens. Thanks
> for checking it.
>
> <tought experiment, not exact, just speculation>
>
> * RTF API:
>    the plug-in uses an internal RTF API of MS Office to obtain more detailed
>    information about the in memory representation of the document. I would
>    assume that the in-memory representation is complete, but it may not be
>    possible to fully interprete or reverse engineer all the info and it may (or
>    not) be possible to extract all information through the RTF API ?
> Or is it fully
>    documented or reverse engineered and fully addressable through the RTF
>    API ?
>

The RTF API is involved only with MS Word. The other major Office apps
do something similar but not using RTF. Their documentation is far
less complete than the API for Word, which hasn't been updated since
before 1999. The RTF used is a modified form of RTF that completely
expresses the document, far richer than plain vanilla RTF.

I'm told that it is possible to extract and fully interpret or reverse
engineer all of the metadata and document data, but that time will be
required to complete the task for the metadata. That is the reason for
the five extensions to describe what is known about dark objects in a
way that ODF apps can take their best shot at correctly rendering the
data affected by the dark objects. So in summary, it's all fully
addressable through the RTF API, mostly documented or reverse
engineered, and ODF 1.2 will be expected to put mechanisms in place so
the dark objects can be better described  in useful ways as more is
learned.

>    The assumed advantage is that by hooking up "live" with the MS Office
>    application (while it is running), you can obtain more information (e.g. by
>    querying over the RTF API, inspecting the in memory representation), than
>    by only "post mortem" analyzing what the MS Office application eventually
>    saved (in .doc , .docx or whatever).

Yes, but the reasons for that advantage deserve discussion. First,
there is no DOC format as such. Rather, it is a series of file formats
with no forward compatibility, only backward compatibility. Here is
how it worked before Office 2007 was released. The binary formats are
dumps to file of IMBR. Newer versions of Word import binaries
generated by older versions by using the same RTF native file support
API and file conversion facility we have been discussing. All older
versions wind up being converted by Word to the version of the IMBR
supported by the particular version doing the importing. So if you
address IMBR directly in Word rather than using an external process
(e.g., OOo's DOC conversion filters), you have way fewer file formats
to address.

Microsoft changed file formats frequently, and because the file
formats were only backwards compatible with the app, the format
changes almost undoubtedly were attributable at least in substantial
part to Microsoft's Upgrade Treadmill strategy. (New Word features
were undoubtedly another part of the story.) Folks using Word could
only open all DOC files that might come their way if they were using
the latest and greatest version of Word. Older versions could not open
the binaries generated by the newer versions. So Microsoft managed to
put enormous pressure on users to upgrade their software by studiously
avoiding implementation of forward compatibility in the DOC file
formats.

The story became a bit more murky on January 31 of this year. But we
know from Microsoft blogs, etc. that when Microsoft added MOOXML
support to Office 2007 it did so in a way that the relevant API(s)
were componentized, put a wrapper around the new API that exposed the
same native file support API as used in the older versions (the one we
have been discussing), then backported the module to all versions of
Office back through Office 2000 and is working on adapting Office 2005
for the Mac to work with the module. That's a rather stunning change
in marketing strategy for a company that has historically forced
upgrades with every new version of Office, offering only backward
compatibility in its file formats and making untold billions of
dollars from its Upgrade Treadmill strategy. From Microsoft's
standpoint, the free MOOXML retrofits for the older versions are
tantamount to giving away Microsoft Office for free; i.e., every
retrofit installed subtracts from the number of MS Office 2007 license
sales. So Microsoft joins the free software movement, but why?

The answer, in my opinion, is that Microsoft is moving the lock-in
point up the stack, e.g., to its Sharepoint and Exchange Server hubs.
And if you study those documents from the Combs v. Microsoft
litigation that were published on Groklaw a few days ago, you'll see
that Office is now sharing undocumented APIs with Sharepoint, IE 7,
and with several other Microsoft platforms and apps. My working
assumption is that the undocumented APIs Office shares with other
Microsoft software likely have some relationship to the yet-to-be
cracked dark objects in the MOOXML formats.

Now comes the perplexing part. Rob Weir discovered that in Office
2007, if you save a Word document to DOC and then convert it to
MOOXML, you get different results than if you don't save to DOC. Data
gets dropped with some features in the DOC version and the rendering
is different. So apparently, the new APIs enable two different sets of
file filters, with one being selected automatically if a DOC file is
opened but the other being used if a new document is created. This has
led to some speculation that Microsoft may be putting a mechanism in
place to drop DOC support in a later version, or at least to focus its
new Word feature work on DOCx rather than attempting to maintain DOC
in parallel.

Deprecating DOC makes some sense when viewing a company with a long
history of forcing upgrades on their customers. And to me it's the
best explanation so far for the "our XML formats have to be compatible
with the "legacy" binary file formats" party line coming out of
Microsoft. I.e., backward compatibility only; no forward
compatibility. Microsoft's repeated references to the binary formats
as "legacy" formats has, until this information surfaced, been a
mystery to me.

But the fact that data gets lost in the Word conversion from DOC files
to MOOXML files also underscores for me that the IMBR are the only
trustworthy target in MS Office for migrating the binary files to ODF.
Microsoft may well have created new APIs for generating new files in
MOOXML formats, which argues for enabling MOOXML-ODF migrations until
the new APIs are reverse engineered or their UIs are documented by
Microsoft, but if the goal is migrating the binary formats to ODF, the
IMBR seem to me to be the only trustworthy interop target for such
migrations.

The greatest danger of targeting the IMBR, to me, is the possibility
that Microsoft will add roadblocks to the Foundation's plug-in in
Office. E.g., the new feature that first appeared in the release
version of Office 2007 that blocks opening of ODF files in Office by
treating all files in Zip format as EOOXML/MOOXML files, which just
might be intended to block 2-way interop via the Foundation plug-in
and force ODF import into the more lossy external processors like the
Sun and and Clever Age tools.

>
>    [but on the other hand, the file that is saved in the EOOXML format is also
>     quite largely specified, except for some legacy tags on page 2199 etc. ;
>     and it certainly allows a full fidelity representation (in the sense that if
>     MS Office opens the same file back later, it should see the same document).
>     Why would this "live" approach over an older "RTF" interface be better ?
>     The Excel 2007 binary infopath story below might be an argument ?]

Mostly  covered above, I think. But yes, the Excel 2007 binary Infoset
experience argues strongly for the IMBR approach

>
> * dark objects / RDF:
>    the plug-in will:
>    * store "dark objects" as a baseline for guaranteed fidelity
>       [which could also be achieved with straight EOOXML, but anyway]

As discussed above, we really don't know how well EOOXML does in
capturing the dark objects, but there is evidence that it is lossy
when migrating from the binary formats to XML. And there are no
barriers to Microsoft adding additional dark objects in MOOXML that
are not preserved in EOOXML-conformant apps. In fact, Microsoft has
incentives to do so. It remains to be established whether EOOXML has a
single full-featured reference app. I.e., I haven't heard anything
about an EOOXML compatibility mode in Office to ensure two-way
fidelity and full interop with apps that implement EOOXML rather than
MOOXML.

>    * add additional information (the RDF stuff), so that other applications
>       have a better chance (but not guaranteed 100% certainty) at
>       "decoding" the information in the binary blob
>    * as time passes on and the reverse engineering of the "dark objects"
>       improves, the amount of required "dark matter" will reduce and the
>       quality of the RDF description will improve, improving the chance of
>       decoding the "dark matter".
>
> Now the interesting part is the combination of the two (the multiplicative
> effect). First, you do have this "live" query of the MS Office application,
> giving you more info than you could obtain with a "post mortem" on the
> saved file. But what do you do with that additional information ? One thing
> you could do is just straight save it out (actually, is that the ECMA file
> format ??).

Your last question is a good one deserving of study. My sniff is that
the answer will likely be no, that the EOOXML formats don't preserve
all the dark objects. There are signs aplenty that Microsoft intended
EOOXML to provide full fidelity and interop only on the MS Office
import leg of the round-trip. The very existence of the dark objects
is proof enough of that in itself.

It's all too easy to recognize that Microsoft has huge incentives to
generate dark objects that are not captured by EOOXML. You have to
keep in mind at all times that Microsoft has not implemented EOOXML as
the  native file formats for its Office/business line software stack.
It has implemented MOOXML. EOOXML seems to be more a subset than a
complete expression of MOOXML. Microsoft wants to sell licenses for
its software, not to sell licenses for other vendors. It has every
incentive to ensure that MOOXML is more featureful than EOOXML. As I
discussed above, EOOXML's role in Microsoft's game plan looks to me a
lot more like RTF's old role than DOC's old role.


But one other thing you could do, is use it to Save As in an
> ODF format, great! Problem seems to be that not everything that is obtained
> from step 1 can is guaranteed to be really interpreted exactly, so as a
> fall-back, the plug-in makes a "backup copy" in a binary blob of the info, so if
> all else fails, at least the document can be recovered from there. But there is
> a  "primary copy" that is the RDF description and should increasingly better
> describe the content of the blob.
>
> One particular problem that I see is that an external application could "read"
> the RDF descriptions of the dark matter and try to render it, but could never be
> allowed to change it, since then the match between the RDF and the
> "backup copy" in the binary blob would be broken. Others have argued here that
> that is exactly the way an XML extension must be handled: only the native
> application is allowed to change it, since other applications might break it if
> they edit it.
>
> What is the advantage of the 2 combined ? I don't see clear yet. Indeed, as
> Daniel proposes as a "tought experiment": taking EOOXML as the starting
> point, translating everything that is possible to ODF and wrapping everything
> that could not be translated into "foreign data", seems quite similar. Why is
> this combination of "live" connection to the RTF API and using ODF + dark
> matter to store the result better than that ?
>
> </tought experiment, not exact, just speculation>
>

I think this is largely covered in the discussion above. I've just
joined OASIS so I am far from up to speed on where the RDF work is
actually at. But one thought that occurred to me is that it might be
feasible for all concerned to collaborate on development and
maintenance of a web site offering constantly updated and standardized
downloads of the relevant RDF information that has been accumulated to
date, freeing conformant application developers and users from needing
to await revisions of the standard and new application releases in
order to process the state-of-the-art RDF data store as best the
application can.

Ie., if the application already recognizes how to handle one RDF
triple, it might be designed to apply a data string from that triple
to another triple if the same string is discovered to be applicable to
the other triple previously not known to have a relationship to the
particular string.

> ...
> > The new Excel 2007 binary infoset format is strong testimony
> > that the in-memory binary representations of the Office file formats
> > and their dumps to binary files are the most stable interop targets
> > for development.
>
> Could you point me to more details on the Excel 2007 binary infoset
> format please. Are you saying that MS Excel 2007 is saving data in a
> binary format that is not documented in detail in  EOOXML ?
>
Yes, it's the new .xlsb BIFF12 Excel Workbook format. Rob Weir
addressed the subject here,
<http://www.robweir.com/blog/2007/01/formats-of-excel-2007.html>.
It's a pretty undeniable example of how EOOXML is not designed to
capture everything that flows through IMBR.

> > A blob is a blob is a blob, whether wrapped with MOOXML tags or ODF
> > tags. But targeting the in-memory binary representations and their
> > dumps to file is in my mind the only practical approach to the
> > problem. I still haven't heard any reason to believe otherwise.
>
> The reasoning of Daniel was that at least EOOXML by large has all features
> specified (except for some legacy tags on page 2199 etc.), whereas the
> RTF representation seems to need a larger amount of binary blobs.
>

Covered by the discussion above, I think.

> But the claim of Daniel is that there would be very little "dark objects",
> since EOOXML largely specifies the document. And, even if not everything
> can be translated to ODF, those elements _are_ specified (in EOOXML),
> so an application that understands EOOXML could actually do something
> with those blobs.
>

Again, we know that not all the dark objects are captured by EOOXML.
The blobs that are identified by the EOOXML spec are not the barrier
for the most part; it's the ones that are not. The EOOXML spec and the
documentation of Microsoft's Office 2003 XML Reference Schemas have
been extremely helpful in cracking the blobs they identify.

> The similarity is clear with: "an application that understands the RDF
> descriptions in ODFX could do something with them."
>
> Do I understand correctly that you claim that the amount of data that needs
> to be stored in "dark objects" in the Foundation approach will reduce over
> time as the reverse engineering of the binary blobs improves, while the approach
> amount of data that cannot be translated faithfully between EOOXML and ODF
> will remain at least constant (because the specs are frozen) or even go up, as
> MS adds binary formats to it.
>

Not because the specs are frozen, but because MS Office only reads
EOOXML; it writes to MOOXML only. It has no compatibility mode for
generating pure EOOXML. We already know that MOOXML uses dark objects
not present in EOOXML and there is every reason to suspect that that
the gap between EOOMXL and MOOXML + blobs will continue to widen.
EOOXML is an import format for Office, not an export format.

> > They address the problem through the ODF foreign metadata
> > tags, a proposed MS Office ODF interop subset, and five proposed ODF
> > extensions that allow adequate description of the blobs" content as
> > more is learned about them.
>
Yes, and through directly addressing IMBR, avoiding the EOOXML honey
trap. See <http://en.wikipedia.org/wiki/Honey_trap>.

> The MS Office ODF interop subset is the third technology here.
>
> Do I understand correctly, you want to specify in ODF 1.2 a subset that
> will reduce ODF a little and avoid those features that are hard to map to
> in the MS Office application and native data formats, so that if all ODF
> processing apps would stick to the subset, it would be a lot easier to
> render these ODF's in an MS Office application ?
>
Where the Foundation and hopefully OASIS, OOo, KWord, etc., are headed
with this is a compatibility mode that could be set in the
application. I.e., if a system administrator wants high interop with
Microsoft Office, the apps are set to compatibility mode. If that
isn't a concern, the apps can be set to employ the full range of ODF
features supported by the app. I presume app developers could make it
easy for users to switch back and forth, depending on whether they
need the superset or subset for a particular use.

E.g., assume a user loads a file generated in ODF by MS Office and
containing metadata identifying it as an interop subset document, the
application could be set to automatically perform any edits on the
document in compatibility mode, but revert to the superset on any
other documents unless the compatibility mode were instantiated
manually by the user.

There would need to be a set of permissions for controlling such
features, though. For example, a system admin may want to exclude
users' ability to use anything other than the interop subset.

It's all about choice, as Microsoft so often reminds us. :-)

> > Do believe Clever Age has solved that problem?
>

No. The Foundation's lead developer is the guy doing the work for
Novell on the Clever Age converter. Given that there is no version of
OOo publicly available that implements the proposed interop subset,
the OOWriter --> MS Word trip has to be lossy. It would be less lossy
were the changes proposed for ODF 1.2 implemented in the major ODF
word processors, although for all of the plugins the problem will
remain for pre-ODF 1.2 documents unless they are first saved in an ODF
1.2 app set for compatibility mode.

But the Clever Age approach should presently be less lossy in the MS
Word --> ODF direction, at least insofar as it is affected by the
inability to map features from one format to the other.

> Their detailed spread sheet of translated and non-translated features
> clearly admits that not everything is translated. I think there is no discussion
> that both formats (ODF and EOOXML) have features that are not easily
> (or not possible) to represent into the other format. Or do you see an
> asymmetry ?
>
I haven't studied this personally, but the Foundation folk say that
for the word processors, the serious problems are in the ODF --> MS
Office direction, because of OOWriter's, (and hence ODF's)
far-richer-than-Word's page layout engine set of tags.

I hope that fairly meets your questions. I'll remind that I am not a
developer so can easily misunderstand involved technical info and am
relying on what I've been told and have learned through checking
things out in various ways, rather than direct personal experience
with the Foundation's plug-in.

Please let me know if what I have written leaves something unanswered
raises further questions.

Best regards,

Marbux



More information about the odf-discuss mailing list