[odf-discuss] ODF and UTF-8/16/32-HEX/DEC/DECHTML

Damon Anderson damon at corigo.com
Wed Feb 21 12:27:09 EST 2007


My apologies for not being more thorough in my explanation. A wonderful  
job of clarification.

Where I take issue is with using < and >. HTML supports UTF-8  
Decimal fully, so should XML. < and > should be XML extensions, not  
defaults. If our standards don't interact with each what good are they?  
These same characters can be properly 'identified' as special characters  
using UTF-8 (a full and international standard) as follows: < = &lt; =  
&#60; | > = &gt; = &#62; | & = &amp; = &#38; (this is UTF-8 Decimal in  
HTML).

Also I see almost all vendors are implementing UTF-16HEX (including OOo).  
I understand that at the root ODF is an XML standard, but it's long term  
goal is far different then that of XML, e.g. it is to provide the ability  
to create a single document standard. Given that 90% of the world's  
languages are non-ASCII this standard seems to be UTF and UTF-8 probably  
isn't enough (given that chinese and Japanese are already pushing into  
requirements of UTF-32) it seems like UTF-16HEX should be the defacto  
standard.

I live, work and develop in Asia, and the lack of proper UTF support from  
databases to OOo has been stageringly difficult to overcome. I may be  
localising into 13 Asian languages this year, and I can tell you right now  
that handling little annoying things like XML using a non-standard  
denomination of &lt; instead of &#60; coming out of my keyboard driver  
means that much more work for me to conform to the other international  
standards I need to conform to beyond ODF.

I understand that UTF is complex (heck it's 9 standards not 1), but it is  
the only truly viable solution available to digital encode the diversity  
of human language, and it is an ISO standard! Shouldn't these two  
standards be linked somehow? Instead of continously having to recreate an  
interface layer between a partially defined standard, where a tiny subset  
of characters are given non-standard denominations like &lt; in XML, and a  
full and complete standard that is being implemented the world over from  
hardware to software like UTF?

In other words, since the goal of ODF is Documents, and not XML (XML has  
other goals) shouldn't the overriding standard for character encoding be  
related to documents (Unicode) and not XML? (who wrote this poorly  
internationalized XML standard anyway? -jk)

-Damon


On Wed, 21 Feb 2007 21:47:14 +0700, Daniel Carrera  
<daniel.carrera at zmsl.com> wrote:

> On Wed, 2007-21-02 at 21:22 +0700, Damon Anderson wrote:
>> A technical question about ODF. How does the specification define the
>> transition between ASCII, extended, ASCII, and UTF-16HEX. I see OOo  
>> making
>> mistakes for example where they convert the ampersand (&) to HTML  
>> (&amp;)
>> rather than UTF-16HEX, that can't be allowed in the ODF spec surely?
>
> Yes, it's allowed, and it's actually mandated. The ampersand thing is
> part of XML itself, not HTML. In XML, the characters & < and > have
> special meaning; and so you need "identities" for when you want to
> denote those characters in your document.
>
> You know that XML uses <tags> <like> <this>. What if your document
> actually contains a '>'? You have to call it &gt; (and < becomes &lt;).
> But to make that work, ampersand (&) has to e a special character too,
> so you need an identity for ampersand too, and so you get &amp;.
>
> All other entities you know (&euro; &pound; etc) are HTML. But &lt; &gt;
> and &amp; are XML.
>
> This issue has nothing to do with ASCII. The ampersand is in ASCII. It
> is character 38. And ODF is not limited to ASCII. ODF documents are
> generally UTF-8, but AFAIK even that is not mandated.
>
>
> Cheers,
> Daniel.
> -- Catalan is essentially Spanish and French spoken at the same time.



-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/



More information about the odf-discuss mailing list