[odf-discuss] ODF and UTF-8/16/32-HEX/DEC/DECHTML
Damon Anderson
damon at corigo.com
Wed Feb 21 12:27:09 EST 2007
My apologies for not being more thorough in my explanation. A wonderful
job of clarification.
Where I take issue is with using < and >. HTML supports UTF-8
Decimal fully, so should XML. < and > should be XML extensions, not
defaults. If our standards don't interact with each what good are they?
These same characters can be properly 'identified' as special characters
using UTF-8 (a full and international standard) as follows: < = < =
< | > = > = > | & = & = & (this is UTF-8 Decimal in
HTML).
Also I see almost all vendors are implementing UTF-16HEX (including OOo).
I understand that at the root ODF is an XML standard, but it's long term
goal is far different then that of XML, e.g. it is to provide the ability
to create a single document standard. Given that 90% of the world's
languages are non-ASCII this standard seems to be UTF and UTF-8 probably
isn't enough (given that chinese and Japanese are already pushing into
requirements of UTF-32) it seems like UTF-16HEX should be the defacto
standard.
I live, work and develop in Asia, and the lack of proper UTF support from
databases to OOo has been stageringly difficult to overcome. I may be
localising into 13 Asian languages this year, and I can tell you right now
that handling little annoying things like XML using a non-standard
denomination of < instead of < coming out of my keyboard driver
means that much more work for me to conform to the other international
standards I need to conform to beyond ODF.
I understand that UTF is complex (heck it's 9 standards not 1), but it is
the only truly viable solution available to digital encode the diversity
of human language, and it is an ISO standard! Shouldn't these two
standards be linked somehow? Instead of continously having to recreate an
interface layer between a partially defined standard, where a tiny subset
of characters are given non-standard denominations like < in XML, and a
full and complete standard that is being implemented the world over from
hardware to software like UTF?
In other words, since the goal of ODF is Documents, and not XML (XML has
other goals) shouldn't the overriding standard for character encoding be
related to documents (Unicode) and not XML? (who wrote this poorly
internationalized XML standard anyway? -jk)
-Damon
On Wed, 21 Feb 2007 21:47:14 +0700, Daniel Carrera
<daniel.carrera at zmsl.com> wrote:
> On Wed, 2007-21-02 at 21:22 +0700, Damon Anderson wrote:
>> A technical question about ODF. How does the specification define the
>> transition between ASCII, extended, ASCII, and UTF-16HEX. I see OOo
>> making
>> mistakes for example where they convert the ampersand (&) to HTML
>> (&)
>> rather than UTF-16HEX, that can't be allowed in the ODF spec surely?
>
> Yes, it's allowed, and it's actually mandated. The ampersand thing is
> part of XML itself, not HTML. In XML, the characters & < and > have
> special meaning; and so you need "identities" for when you want to
> denote those characters in your document.
>
> You know that XML uses <tags> <like> <this>. What if your document
> actually contains a '>'? You have to call it > (and < becomes <).
> But to make that work, ampersand (&) has to e a special character too,
> so you need an identity for ampersand too, and so you get &.
>
> All other entities you know (€ £ etc) are HTML. But < >
> and & are XML.
>
> This issue has nothing to do with ASCII. The ampersand is in ASCII. It
> is character 38. And ODF is not limited to ASCII. ODF documents are
> generally UTF-8, but AFAIK even that is not mandated.
>
>
> Cheers,
> Daniel.
> -- Catalan is essentially Spanish and French spoken at the same time.
--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
More information about the odf-discuss
mailing list