Character Model for the World Wide Web 1

Character Model for the World Wide Web 1.0
Character Model for the World Wide Web 1.0
W3C Working
Draft 20 February 2002
This version:
(available in
XML
HTML
, and as a
Zip archive
Latest version:
Previous versions:
Editors:
Martin J. Dürst (W3C)

François Yergeau (Alis
Technologies)
Richard Ishida (Xerox,
GKLS)

Misha Wolf (Reuters
Ltd.)

Asmus Freytag (ASMUS,
Inc.)

Tex Texin (Progress Software
Corp.)

W3C
MIT
INRIA
Keio
), All Rights Reserved. W3C
liability
trademark
document use
, and
software licensing
rules apply.
Abstract
This Architectural Specification provides authors of specifications,
software developers, and content developers with a common reference for
interoperable text manipulation on the World Wide Web. Topics addressed include
encoding identification, early uniform normalization, string identity matching,
string indexing, and URI conventions, building on the Universal Character Set,
defined jointly by Unicode and ISO/IEC 10646. Some introductory material on
characters and character encodings is also provided.
Status of this Document
This section describes the status of this document at the time
of its publication. Other documents may supersede this document. The latest
status of this series of documents is maintained at the W3C.
This is a W3C Working Draft published between the
first Last
Call Working Draft of 26 January 2001
and a planned second Last Call.
This interim publication is used to document further progress made on addressing
the comments received during the first Last Call. A list of last call comments
with their status can be found in the
disposition of
comments
Members only
).
Work is still ongoing on addressing the comments received during the
first Last Call. We do not encourage comments on this Working Draft; instead we
ask reviewers to wait for the second Last Call. We will announce the second
Last Call on the W3C Internationalization public mailing list (
www-international@w3.org
).
Comments from the public and from organizations outside the W3C may be sent to
www-i18n-comments@w3.org
archive
).
Comments from W3C Working Groups may be sent directly to the
Internationalization Interest Group (w3c-i18n-ig@w3.org), with cross-posting to
the originating Group, to facilitate discussion and resolution.
Due to the architectural nature of this document, it affects a large
number of W3C Working Groups, but also software developers, content developers,
and writers and users of specifications outside the W3C that have to interface
with W3C specifications.
This document is published as part of the
W3C
Internationalization Activity
by the
Internationalization Working
Group
Members
only
), with the help of the Internationalization Interest Group. The
Internationalization Working Group will not allow early implementation to
constrain its ability to make changes to this specification prior to final
release. Publication as a Working Draft does not imply endorsement by the W3C
Membership. It is inappropriate to use W3C Working Drafts as reference material
or to cite them as other than "work in progress". A
list of current
W3C Recommendations and other technical
documents
can be found at
For information about the requirements that informed the development
of important parts of this specification, see
Requirements for String
Identity Matching and String Indexing
[CharReq]
Table of Contents
Introduction
1.1
Goals and Scope
1.2
Background
1.3
Terminology and Notation
Conformance
Characters
3.1
Perceptions of Characters
3.1.1
Introduction
3.1.2
Units of Aural Rendering
3.1.3
Units of Visual
Rendering
3.1.4
Units of Input
3.1.5
Units of Collation
3.1.6
Units of Storage
3.1.7
Summary
3.2
Digital Encoding of Characters
3.3
Transcoding
3.4
Strings
3.5
Reference Processing Model
3.6
Choice and Identification of Character
Encodings
3.6.1
Mandating a unique character
encoding
3.6.2
Character Encoding
Identification
3.6.3
Private Use Code Points
3.7
Character Escaping
Early Uniform Normalization
4.1
Motivation
4.1.1
Why do we need character normalization?
4.1.2
The choice of early uniform normalization
4.2
Definitions for W3C Text
Normalization
4.2.1
Unicode-normalized Text
4.2.2
Include-normalized Text
4.2.3
Fully Normalized Text
4.2.4
Examples
4.3
Responsibility for
Normalization
Compatibility and Formatting
Characters
String Identity Matching
String Indexing
Character Encoding in URI References
Referencing the Unicode Standard and
ISO/IEC 10646
Appendices
Examples of Characters, Keystrokes and
Glyphs
Acknowledgements
References
C.1
Normative
References
C.2
Other References
Change Log (Non-Normative)
D.1
Changes since
D.2
Changes since
1 Introduction
1.1 Goals and Scope
The goal of this document is to facilitate use of the Web by all
people, regardless of their language, script, writing system, and cultural
conventions, in accordance with the
W3C goal of universal
access
. One basic prerequisite to achieve this goal is to be able to
transmit and process the characters used around the world in a well-defined and
well-understood way.
The main target audience of this document is W3C specification
developers. This document defines conformance requirements for other W3C
specifications. This document and parts of it can also be referenced from other
W3C specifications.
Other audiences of this document include software developers,
content developers, and authors of specifications outside the W3C. Software
developers and content developers implement and use W3C specifications. This
document defines some conformance requirements for software developers and
content developers that implement and use W3C specifications. It also helps
software developers and content developers to understand the character-related
provisions in other W3C specifications.
The character model described in this document provides authors of
specifications, software developers, and content developers with a common
reference for consistent, interoperable text manipulation on the World Wide
Web. Working together, these three groups can build a more international
Web.
Topics addressed include encoding identification, early uniform
normalization, string identity matching, string indexing, and URI conventions.
Some introductory material on characters and character encodings is also
provided.
Topics not addressed or barely touched include collation (sorting),
fuzzy matching and language tagging. Some of these topics may be addressed in a
future version of this specification.
At the core of the model is the Universal Character Set (UCS),
defined jointly by The Unicode Standard
[Unicode]
and ISO/IEC
10646
[ISO/IEC 10646]
. In this document,
Unicode
is used
as a synonym for the Universal Character Set. The model will allow Web
documents authored in the world's scripts (and on different platforms) to be
exchanged, read, and searched by Web users around the world.
All W3C specifications must conform to this document (see section
2 Conformance
). Authors of other specifications (for
example, IETF specifications) are strongly encouraged to take guidance from
it.
Since other W3C specifications will be based on some of the
provisions of this document, without repeating them, software developers
implementing W3C specifications must conform to these provisions.
1.2 Background
This section provides some historical background on the topics
addressed in this document.
Starting with
Internationalization of the Hypertext Markup
Language
[RFC 2070]
, the Web community has recognized
the need for a character model for the World Wide Web. The first step towards
building this model was the adoption of Unicode as the document character set
for HTML.
The choice of Unicode was motivated by the fact that Unicode:
is the only universal character repertoire available,
covers the widest possible range,
provides a way of referencing characters independent of the
encoding of a resource,
is being updated/completed carefully,
is widely accepted and implemented by industry.
W3C adopted Unicode as the document character set for HTML in
[HTML 4.0]
. The same approach was later used for specifications
such as XML 1.0
[XML 1.0]
and CSS2
[CSS2]
. Unicode
now serves as a common reference for W3C specifications and applications.
The IETF has adopted some policies on the use of character sets on
the Internet (see
[RFC 2277]
).
When data transfer on the Web remained mostly unidirectional (from
server to browser), and where the main purpose was to render documents, the use
of Unicode without specifying additional details was sufficient. However, the
Web has grown:
Data transfers among servers, proxies, and clients, in all
directions, have increased.
Non-ASCII characters
[MIME]
are being used in
more and more places.
Data transfers between different protocol/format elements
(such as element/attribute names, URI components, and textual content) have
increased.
More and more APIs are defined, not just protocols and
formats.
In short, the Web may be seen as a single, very large application
(see
[Nicol]
), rather than as a collection of small independent
applications.
While these developments strengthen the requirement that Unicode be
the basis of a character model for the Web, they also create the need for
additional specifications on the application of Unicode to the Web. Some
aspects of Unicode that require additional specification for the Web include:
Choice of encoding forms (UTF-8, UTF-16, UTF-32).
Counting characters, measuring string length in the presence
of variable-length encodings and combining characters).
Duplicate encodings (e.g. precomposed vs decomposed).
Use of control codes for various purposes (e.g.
bidirectionality control, symmetric swapping, etc.).
It should be noted that such properties also exist in legacy
encodings (where
legacy encoding
is taken to mean any character
encoding not based on Unicode), and in many cases have been inherited by
Unicode in one way or another from such legacy encodings.
The remainder of this document presents additional specifications
and requirements to ensure an interoperable character model for the Web, taking
into account earlier work (from W3C, ISO and IETF).
1.3 Terminology and Notation
For the purpose of this specification, the
producer
of
text data is the sender of the data in the case of protocols, and the tool that
produces the data in the case of formats. The
recipient
of text
data is the software module that receives the data.
NOTE:
A software module may be both a recipient and a producer.
Unicode code points are denoted as U+hhhh, where "hhhh" is a
sequence of at least four, and at most six hexadecimal digits.
2 Conformance
In this document, requirements are expressed using the key words
MUST
", "
MUST NOT
",
REQUIRED
", "
SHALL
" and "
SHALL
NOT
". Recommendations are expressed using the key words
SHOULD
", "
SHOULD NOT
" and
RECOMMENDED
" (see the note below). "
MAY
" and
OPTIONAL
" are used to indicate optional features or
behaviour. These keywords are used in accordance with RFC 2119
[RFC 2119]
NOTE:
RFC 2119 makes it clear that requirements that use
SHOULD
are not optional and should be complied with unless there are specific reasons not to: "
This word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a
particular item, but the full implications must be understood and
carefully weighed before choosing a different course.
This specification places conformance requirements on specifications,
on software and on Web content. To aid the reader, all requirements are
preceded by '
[X]
' where '
' is one of '
' for specifications, '
' for software
implementations, and '
' for Web content. These markers indicate the relevance
of the requirement and allow the reader to quickly locate relevant requirements
using the browser's search function.
[S]
[I]
[C]
In
order to conform to this document, specifications
MUST NOT
violate any requirements preceded by [S], software
MUST NOT
violate any requirements preceded by [I], and content
MUST
NOT
violate any requirements preceded by [C].
[S]
Every W3C specification
MUST
conform to the requirements applicable to specifications,
specify that implementations
MUST
conform to
the requirements applicable to software, and
specify that content created according to that specification
MUST
conform to the requirements applicable to content.
[S]
If an existing W3C specification
does not conform to the requirements in this document, then the next version of
that specification
SHOULD
be modified in order to
conform.
[I]
Where this specification contains
a procedural description, it
MUST
be understood as a way to
specify the desired external behavior. Implementations
MAY
use other ways of achieving the same results, as long as observable behavior is
not affected.
3 Characters
3.1 Perceptions of Characters
3.1.1 Introduction
The glossary entry in
[Unicode 3.0]
gives:
Character. (1) The smallest component of written language
that has semantic values; refers to the abstract meaning and/or shape
...
The word '
character
' is used in many contexts, with
different meanings. Human cultures have radically differing writing systems,
leading to radically differing concepts of a character. Such wide variation in
end user experience can, and often does, result in misunderstanding. This
variation is sometimes mistakenly seen as the consequence of imperfect
technology. Instead, it derives from the great flexibility and creativity of
the human mind and the long tradition of writing as an important part of the
human cultural heritage. The alphabetic approach used by scripts such as Latin,
Cyrillic and Greek is only one of several possibilities.
EXAMPLE:
Japanese
hiragana and katakana are syllabaries. A character in these scripts corresponds
to a syllable (usually a combination of consonant plus vowel).
EXAMPLE:
Korean Hangul is a featural syllabary that combines symbols for
individual sounds of the language into square syllabic blocks. Depending on the
user and the application, either the individual symbols or the syllabic
clusters can be considered to be characters.
EXAMPLE:
Indic scripts
are abugidas. Each consonant letter carries an inherent vowel that is
eliminated or replaced using semi-regular or irregular ways to combine
consonants and vowels into clusters. Depending on the user and the application,
either individual consonants or vowels, or the consonant or consonant-vowel
clusters can be perceived as characters.
EXAMPLE:
Arabic script is
an example of an abjad. Short vowel sounds are typically not written at all.
When they are written they are indicated by the use of combining marks placed
above and below the consonantal letters.
The developers of W3C specifications, and the developers of
software based on those specifications, are likely to be more familiar with
usages they have experienced and less familiar with the wide variety of usages
in an international context. Furthermore, within a computing context,
characters are often confused with related concepts, resulting in incomplete or
inappropriate specifications and software.
This section examines some of these contexts, meanings and
confusions.
3.1.2 Units of Aural Rendering
In some scripts, characters have a close relationship to phonemes
(a
phoneme
is a minimally distinct sound in the context of a
particular spoken language), while in others they are closely related to
meanings. Even when characters (loosely) correspond to phonemes, this
relationship may not be simple, and there is rarely a one-to-one correspondence
between character and phoneme.
EXAMPLE:
In the English sentence,
They were too close to the door to close it.
" the same character
' is used to represent both /s/ and /z/ phonemes.
EXAMPLE:
In many scripts a single character may represent a sequence of
phonemes, such as the syllabic characters of Japanese hiragana.
EXAMPLE:
In many writing systems a sequence of characters may represent a
single phoneme, for example '
wr
' and '
ng
' in
writing
".
[S]
[I]
Specifications
and software
MUST NOT
assume that there is a one-to-one
correspondence between characters and the sounds of a
language.
3.1.3 Units of Visual
Rendering
Visual rendering introduces the notion of a
glyph
Glyphs
are defined by ISO/IEC 9541-1
[ISO/IEC 9541-1]
as
a recognizable abstract graphic symbol which is independent of a
specific design
". There is
not
a one-to-one correspondence
between characters and glyphs:
A single character can be represented by multiple glyphs
(each glyph is then part of the representation of that character). These glyphs
may be physically separated from one another.
A single glyph may represent a sequence of characters (this
is the case with ligatures, among others).
A character may be rendered with very different glyphs
depending on the context.
A single glyph may represent different characters (e.g.
capital Latin A, capital Greek A and capital Cyrillic A).
Each glyph can be represented by a number of different glyph
images; a set of glyph images makes up a
font
. Glyphs can be
construed as the basic units of organization of the visual rendering of text,
just as characters are the basic unit of organization of encoded text.
[S]
[I]
Specifications
and software
MUST NOT
assume a one-to-one mapping between
character codes and units of displayed text.
See the appendix
A Examples of Characters, Keystrokes and
Glyphs
for examples of the
complexities of character to glyph mapping.
Some scripts, in particular Arabic and Hebrew, are written from
right to left. Text including characters from these scripts can run in both
directions and is therefore called bidirectional text (see example
A.6
in Appendix A). The Unicode
Standard
[Unicode]
requires that characters be stored and
interchanged in logical order.
[S]
Protocols,
data formats and APIs
MUST
store, interchange or process
text data in logical order.
In the presence of bidirectional text, two possible
selection modes must be considered. The first is
logical selection
mode
, which selects all the characters
logically
located
between the end-points of the user's mouse gesture. Here the user selects from
between the first and second letters of the second word to the middle of the
number. Logical selection looks like this:
In memory
On
screen
It is a consequence of the bidirectionality of the text that a
single, continuous logical selection in memory results in a
discontinuous
selection appearing on the screen
. This discontinuity, as well as the
somewhat unintuitive behavior of the cursor, makes some users prefer a
visual selection mode
, which selects all the characters
visually
located between the end-points of the user's mouse
gesture. With the same mouse gesture as before, we now obtain:
In
memory
On screen
In this mode, a single visual selection range results in
two
logical ranges, which have to be accommodated by protocols,
APIs and implementations.
[S]
Specifications of protocols
and APIs that involve selection of ranges
SHOULD
provide for
discontiguous selections, at least to the extent necessary to support
implementation of visual selection on screen on top of those protocols and
APIs.
3.1.4 Units of Input
In keyboard input, it is
not
always the case that
keystrokes and input characters correspond one-to-one. A limited number of keys
can fit on a keyboard. Some keyboards will generate multiple characters from a
single keypress. In other cases ('
dead keys
') a key will generate
no characters, but affect the results of subsequent keypresses. Many writing
systems have far too many characters to fit on a keyboard and must rely on more
complex
input methods
, which transform keystroke sequences into
character sequences. Other languages may make it necessary to input some
characters with special modifier keys. See
A Examples of Characters, Keystrokes and
Glyphs
for examples of non-trivial input.
[S]
[I]
Specifications
and software
MUST NOT
assume that a single keystroke results
in a single character, nor that a single character can be input with a single
keystroke (even with modifiers), nor that keyboards are the same all over the
world.
3.1.5 Units of Collation
String comparison as used in sorting and searching is based on
units which do not in general have a one-to-one relationship to encoded
characters. Such string comparison can aggregate a character sequence into a
single
collation unit
with its own position in the sorting order,
can separate a single character into multiple collation units, and can
distinguish various aspects of a character (case, presence of diacritics, etc.)
to be sorted separately (multi-level sorting).
In addition, a certain amount of pre-processing may also be
required, and in some languages (such as Japanese and Arabic) sort order may be
governed by higher order factors such as phonetics or word roots. Collation
methods may also vary by application.
EXAMPLE:
In traditional Spanish sorting, the letter sequences
ch
' and '
ll
' are treated as atomic collation units.
Although Spanish sorting, and to some extent Spanish everyday use, treat
ch
' as a single unit, current digital encodings treat it as two
letters, and keyboards do the same (the user types '
', then
').
EXAMPLE:
In most languages, the letter
' is sorted as two consecutive collation units: '
and '
'.
EXAMPLE:
The sorting of text written in a
bicameral script (i.e. a script which has distinct upper and lower case
letters) is usually required to ignore case differences in a first pass; case
is then used to break ties in a later pass.
EXAMPLE:
Treatment of
accented letters in sorting is dependent on the script or language in question.
The letter '
' is treated as a modified '
' in
French, but as a letter completely independent from '
' (and
sorting after '
') in Swedish. In German certain applications
treat the letter '
' as if it were the sequence
oe
'.
EXAMPLE:
In Thai the sequence U+0E44 U+0E01 must
be sorted as if it was written U+0E01 U+0E44. Reordering is typically done
during an initial pre-processing stage.
EXAMPLE:
German dictionaries typically sort '
', '
' and '
' together with '
', '
' and '
' respectively. On the other hand, German telephone books typically sort '
', '
' and '
' as if they were spelled '
ae
', '
oe
' and '
ue
'. Here the application is affecting the collation algorithm used.
[S]
[I]
Software
that sorts or searches text for users
MUST
do so on the
basis of appropriate collation units and ordering rules for the relevant
language and/or application.
Note that, where searching or sorting is done dynamically, particularly in
a multilingual environment, the 'relevant language' should be determined to be that of
the current user, and may thus differ from user to user.
[S]
[I]
Software
that allows users to sort or search text
SHOULD
allow the user to select
alternative rules for collation units and ordering.
[S]
[I]
When sorting and searching in the context of a particular language, it
MUST
be possible to deal gracefully with strings
being compared that contain Unicode characters not normally associated with that language.
A default collation order for all Unicode characters can be obtained
from ISO/IEC 14651
[ISO/IEC 14651]
or from Unicode Technical Report #10, the Unicode Collation Algorithm
[UTR #10]
. This default ordering can be used in conjunction with rules tailored for a particular locale
to ensure a predictable ordering and comparison of strings, whatever
characters they include.
3.1.6 Units of Storage
Computer storage and communication rely on units of physical
storage and information interchange, such as bits and bytes (also known as
octets, as nowadays the word bytes is generally considered to mean 8-bit
bytes). A frequent error in specifications and implementations is the equating
of characters with units of physical storage. The mapping between characters
and such units of storage is actually quite complex, and is discussed in the
next section,
3.2 Digital Encoding of Characters
[S]
[I]
Specifications
and software
MUST NOT
assume a one-to-one relationship
between characters and units of physical storage.
3.1.7 Summary
The term
character
is used differently in a variety
of contexts and often leads to confusion when used outside of these contexts.
In the context of the digital representations of text, a character can be
defined informally as a small logical unit of text.
Text
is then
defined as sequences of characters. While such an informal definition is
sufficient to create or capture a common understanding in many cases, it is
also sufficiently open to create misunderstandings as soon as details start to
matter. In order to write effective specifications, protocol implementations,
and software for end users, it is very important to understand that these
misunderstandings can occur.
[S]
When specifications use the
term '
character
' it
MUST
be clear which of the
possible meanings they intend.
[S]
Specifications
SHOULD
avoid the use of the term '
character
' if a more specific term is
available.
3.2 Digital Encoding of Characters
To be of any use in computers, in computer communications and in
particular on the World Wide Web, characters must be encoded. In fact, much of
the information processed by computers over the last few decades has been
encoded text, exceptions being images, audio, video and numeric data. To
achieve text encoding, a large variety of encoding schemes have been devised,
which can loosely be defined as mappings between the character sequences that
users manipulate and the sequences of bits that computers manipulate.
Given the complexity of text encoding and the large variety of
schemes for character encoding invented throughout the computer age, a more
formal description of the encoding process is useful. The process of defining a
text encoding can be described as follows (see
[UTR #17]
for a more
detailed description):
A set of characters to be encoded is identified. The
characters are pragmatically chosen to express text and to efficiently allow
various text processes in one or more target languages. They may not correspond
precisely to what users perceive as letters and other characters. The set of
characters is called a
repertoire
Each character in the repertoire is then associated with a
(mathematical, abstract) non-negative integer, the
code point
(also known as a
character number
or
code position
).
The result, a mapping from the repertoire to the set of non-negative integers,
is called a
coded character set (CCS)
To enable use in computers, a suitable base datatype is
identified (such as a byte, a 16-bit unit of storage or other) and a
character encoding form (CEF)
is used, which encodes the abstract
integers of a
CCS
into sequences
of the
code units
of the base datatype. The encoding form can be
extremely simple (for instance, one which encodes the integers of the
CCS
into the natural
representation of integers of the chosen datatype of the computing platform) or
arbitrarily complex (a variable number of code units, where the value of each
unit is a non-trivial function of the encoded integer).
To enable transmission or storage using byte-oriented devices,
serialization scheme
or
character encoding scheme
(CES)
is next used. A
CES
is a mapping of the code units
of a
CEF
into well-defined
sequences of bytes, taking into account the necessary specification of
byte-order for multi-byte base datatypes and including in some cases switching
schemes between the code units of multiple
CES
es (an example is ISO
2022). A
CES
, together
with the
CCS
es it is used with,
is identified by an
IANA
charset identifier.
Given a sequence of bytes representing text and a
charset
identifier,
one can in principle unambiguously recover the sequence of characters of the
text.
See
3.6.2 Character Encoding
Identification
for a discussion of the term '
charset
'.
NOTE:
The term '
character encoding
' is somewhat ambiguous,
as it is sometimes used to describe the actual process of encoding characters
and sometimes to denote a particular way to perform that process (as in
this file is in the X character encoding
"). Context normally
allows the distinction of those uses, once one is aware of the ambiguity.
In very simple cases, the whole encoding process can be collapsed to
a single step, a trivial one-to-one mapping from characters to bytes; this is
the case, for instance, for US-ASCII
[MIME]
and ISO-8859-1.
Text data is said to be in a
Unicode encoding form
if it is encoded in UTF-8, UTF-16 or
UTF-32.
3.3 Transcoding
Transcoding
is the process of converting text data from
one
Character Encoding Form
to another.
Transcoders work only at the level of character encoding and do not parse the
text; consequently, they do not deal with character escapes such as numeric
character references (see
3.7 Character Escaping
) and do not adjust
embedded character encoding information (for instance in an XML declaration or
in an HTML
meta
element).
NOTE:
Transcoding may involve one-to-one, many-to-one, one-to-many or
many-to-many mappings. In addition, the storage order of characters varies
between encodings: some, such as Unicode, prescribe logical ordering while
others use visual ordering; among encodings that have separate diacritics, some
prescribe that they be placed before the base character, some after. Because of
these differences in sequencing characters, transcoding may involve reordering:
thus XYZ may map to yxz.
normalizing
transcoder
is a transcoder that converts from a legacy encoding to a
Unicode encoding form
and
ensures that the result is in Unicode
Normalization Form C (see
4.2.1 Unicode-normalized Text
). For most
legacy encodings, it is possible to construct a normalizing transcoder; it is
not possible to do so if the encoding's repertoire contains characters not in
Unicode.
3.4 Strings
Various specifications use the notion of a '
string
',
sometimes without defining precisely what is meant and sometimes defining it
differently from other specifications. The reason for this variability is that
there are in fact multiple reasonable definitions for a string, depending on
one's intended use of the notion; the term '
string
' is used for
all these different notions because these are actually just different views of
the same reality: a piece of text stored inside a computer. This section
provides specific definitions for different notions of '
string
which may be reused elsewhere.
Byte string
: A string viewed as a
sequence of bytes representing characters in a particular encoding. This
corresponds to a
CES
. As a definition for a
string, this definition is most often useless, except when the textual nature
is unimportant and the string is considered only as a piece of opaque data with
a length in bytes.
[S]
Specifications in
general
SHOULD NOT
define a string as a '
byte
string
'.
Code unit string
: A string
viewed as a sequence of code units representing characters in a particular
encoding. This corresponds to a
CEF
. This
definition is useful in APIs that expose a physical representation of string
data. Example: For the DOM
[DOM Level 1]
, UTF-16 was chosen based on
widespread implementation practice.
Character string
: A string
viewed as a sequence of characters, each represented by a code point in Unicode
[Unicode]
. This is usually what programmers consider to be a
string, although it may not match exactly what most users perceive as
characters. This is the highest layer of abstraction that ensures
interoperability with very low implementation effort.
[S]
The '
character string
definition of a string is generally the most useful and
SHOULD
be used by most specifications, following the
examples of Production [2] of XML 1.0
[XML 1.0]
, the SGML
declaration of HTML 4.0
[HTML 4.01]
, and the character model of RFC
2070
[RFC 2070]
EXAMPLE:
Consider the
string
comprising the characters U+233B4 (a Chinese character meaning 'stump of tree'), U+2260
NOT EQUAL TO
and U+0030
DIGIT ZERO
, encoded in
UTF-16 in big-endian byte order. The rows of the following table show the
string viewed as a character string, code unit string and byte string,
respectively:
Character string
U+233B4
U+2260
U+0030
Code unit string
D84C
DFB4
2260
0030
Byte
string
D8
4C
DF
B4
22
60
00
30
NOTE:
It is also possible to view a string as a sequence of
graphemes
. In this case the string is divided into text units that
correspond to the user's perception of where character boundaries occur in a
visually rendered text. However, there is no standard rule for the segmentation
of text in this way, and the segmentation will vary from language to language
and even from user to user. Examples of possible approaches can be found in
sections 5.12 and 5.15 of the Unicode Standard
[Unicode 3.0]
3.5 Reference Processing Model
Many Internet protocols and data formats, most notably the very
important Web formats HTML, CSS and XML, are based on text. In those formats,
everything is text but the relevant specifications impose a structure on the
text, giving meaning to certain constructs so as to obtain functionality in
addition to that provided by plain text. HTML and XML are
markup
languages
, defining entities entirely composed of text but with
conventions allowing the separation of this text into
markup
and
character data
. Citing from the XML 1.0 specification
[XML 1.0]
section
2.4
Text consists of intermingled character data and markup.
[...] All text that is not markup constitutes the
character data
of the document.
For the purposes of this section, the important aspect is that
everything is text, that is, a sequence of characters.
Since its early days, the Web has seen the development of a
Reference Processing Model
, first described for HTML in RFC 2070
[RFC 2070]
. This model was later embraced by XML and CSS. It is
applicable to any data format or protocol that is text-based as described
above. The essence of the Reference Processing Model is the use of Unicode as a
common reference. Use of the Reference Processing Model by a specification does
not, however, require that implementations actually use Unicode. The
requirement is only that the implementations behave as if the processing took
place as described by the Model.
A specification conforms to the Reference Processing Model if all of
the following apply:
[S ]
Specifications
MUST
be defined in terms of Unicode characters, not bytes or
glyphs.
[S]
Specifications
SHOULD
allow the use of the full range of Unicode code
points from U+0 to U+0FFFF inclusive; any exceptions
SHOULD
be listed and justified; code points above U+10FFFF
MUST NOT
be used.
[S]
Specifications
MAY
allow use of any character encoding which can be
transcoded to Unicode for its text entities.
[S]
Specifications
MAY
choose to disallow or deprecate some encodings and to
make others mandatory. Independent of the actual encoding, the specified
behavior
MUST
be the same
as if
the processing
happened as follows:
The encoding of any text entity received by the
application implementing the specification
MUST
be
determined and the text entity
MUST
be interpreted as a
sequence of Unicode characters - this
MUST
be equivalent to
transcoding the entity to some Unicode encoding form, adjusting any character
encoding label if necessary, and receiving it in that Unicode encoding
form.
All processing
MUST
take place on this
sequence of Unicode characters.
If text is output by the application, the sequence of
Unicode characters
MUST
be encoded using an encoding chosen
among those allowed by the specification.
[S]
If a specification is such
that multiple text entities are involved (such as an XML document referring to
external parsed entities), it
MAY
choose to allow these
entities to be in different character encodings. In all cases, the
Reference Processing Model
MUST
be applied to all entities.
[S]
All specifications that involve
text
MUST
specify processing according to the
Reference Processing
Model
NOTE:
All specifications that derive from the XML 1.0 specification
[XML 1.0]
automatically inherit this Reference Processing Model.
XML is entirely defined in terms of Unicode characters and mandates the UTF-8
and UTF-16 encodings while allowing any other encoding for parsed entities.
NOTE:
When specifications choose to allow encodings other than Unicode
encodings, implementers should be aware that the correspondence between the
characters of a legacy encoding and Unicode characters may in practice depend
on the software used for transcoding. See the Japanese XML Profile
[XML Japanese Profile]
for examples of such inconsistencies.
3.6 Choice and Identification of Character
Encodings
Because encoded text
cannot
be interpreted and
processed without knowing the encoding, it is vitally important that the
character encoding (see
3.2 Digital Encoding of Characters
) is known at all
times and places where text is exchanged or processed.
In what follows we use '
character encoding
' to mean either CEF or CES depending on the context. When text transmitted as a byte stream is
involved, for instance in a protocol, specification of a CES is required
to ensure proper interpretation; in contexts such as an API, where the
environment (typically the processor architecture) specifies the byte
order of multibyte quantities, specification of a CEF suffices.
[S]
Specifications
MUST
either specify a unique encoding, or provide character encoding identification
mechanisms such that the encoding of text can always be reliably
identified.
[S]
When
designing a new protocol, format or API, specifications
SHOULD
mandate a unique character
encoding.
3.6.1 Mandating a unique character
encoding
Mandating a unique character encoding is simple, efficient, and
robust. There is no need for specifying, producing, transmitting, and
interpreting encoding tags. At the receiver, the encoding will always be
understood. There is also no ambiguity if data is transferred
non-electronically and later has to be converted back to a digital
representation. Even when there is a need for compatibility with existing data,
systems, protocols and applications, multiple encodings can often be dealt with
at the boundaries or outside a protocol, format, or API. The
DOM
[DOM Level 1]
is an
example of where this was done. The advantages of choosing a unique encoding
become more important the smaller the pieces of text used are and the closer to
actual processing the specification is.
[S]
When a unique encoding is
mandated, the encoding
MUST
be UTF-8, UTF-16 or
UTF-32.
[S]
If a unique
encoding is mandated and compatibility with US-ASCII is desired, UTF-8 (see
[RFC 2279]
) is
RECOMMENDED
In
other situations, such as for APIs, UTF-16 or UTF-32 may be more appropriate.
Possible reasons for choosing one of these include efficiency of internal
processing and interoperability with other processes.
NOTE:
The IETF Charset Policy
[RFC 2277]
specifies that
on the Internet "
Protocols MUST be able to use the UTF-8
charset
".
NOTE:
The XML 1.0 specification
[XML 1.0]
requires all
conforming XML processors to accept both UTF-16 and UTF-8.
3.6.2 Character Encoding
Identification
The MIME Internet specification
[MIME]
provides a
good example of a mechanism for character encoding identification. The MIME
charset
parameter definition is intended to supply sufficient
information to uniquely decode the sequence of bytes of the received data into
a sequence of characters. The values are drawn from the IANA charset registry
[IANA]
NOTE:
In practice there is wide variation among implementations, so
uniqueness cannot be depended upon. See the end of
3.5 Reference Processing Model
for more information.
NOTE:
The term
charset
derives from '
character
set
', an expression with a long and tortured history (see
[Connolly]
for a discussion).
[S]
Specifications
SHOULD
avoid using the terms '
character
set
' and '
charset
' to refer to a character
encoding, except when the latter is used to refer to the MIME
charset
parameter or its IANA-registered values. The terms
character encoding
', '
character encoding form
' or '
character encoding scheme
are
RECOMMENDED
NOTE:
In XML, the XML declaration or the text declaration contains a
pseudo-attribute called
encoding
which identifies the character
encoding using the IANA charset.
NOTE:
Unfortunately, some charset identifiers do not represent a single,
unique encoding scheme.
Instead, these identifers denote a number of slight variations of an
encoding scheme.
Even though slight, the differences may be crucial and may vary over
time.
For these identifiers, recovery of the character sequence from a byte
sequence is ambiguous.
For example, the character encoded as 0x5C in the Shift-JIS encoding scheme is ambiguous.
The character sometimes represents a
YEN SIGN
and sometimes represents a
REVERSE SOLIDUS
See the
[XML Japanese Profile]
for more detail on this example and for
additional examples of such ambiguous charset identifiers.
The IANA charset registry is the official list of names and
aliases for character encodings on the Internet.
[S]
If the unique encoding
approach is not taken, specifications
SHOULD
mandate the use
of the IANA charset registry names, and in particular the names identified in
the registry as '
MIME preferred names
', to designate character
encodings in protocols, data formats and APIs.
[S]
The '
x-
' convention for
unregistered character encoding names
SHOULD NOT
be used,
having led to abuse in the past.
('
x-
' was used
for character encodings that were widely used, even long after there was an
official registration.)
[I]
[C]
Content and software
that label textual data
MUST
use one of the names mandated
by the appropriate specification (e.g. the XML specification when editing XML
text) and
SHOULD
use the MIME preferred name of an encoding
to label data in that encoding.
[I]
[C]
An IANA-registered
charset
name
MUST NOT
be used to label textual data
in an encoding other than the one identified in the IANA registration of that
name.
[S]
If the unique encoding
approach is not chosen, specifications
MUST
designate at
least one of the UTF-8 and UTF-16 encoding forms of Unicode as admissible
encodings and
SHOULD
choose at least one of UTF-8 or UTF-16
as mandated encoding forms (encoding forms that
MUST
be
supported by implementations of the specification).
[S]
Specifications
MAY
define either UTF-8 or UTF-16 as a default encoding form (or both if they
define suitable means of distinguishing them), but they
MUST
NOT
use any other character encoding as a default.
[S]
Specifications
MUST NOT
use heuristics to determine the encoding of data.
[I]
Receiving
software
MUST
determine the encoding of data from available
information according to appropriate specifications.
[I]
When an IANA-registered
charset
name is recognized, receiving software
MUST
interpret the
received data according to the encoding associated with the name in the IANA
registry.
[I]
When no charset
is provided receiving software
MUST
adhere to the default
encoding(s) specified in the specification.
[I]
Receiving software
MAY
recognize as many encodings (names and aliases) as
appropriate.
A field-upgradeable mechanism may be appropriate
for this purpose. Certain encodings are more or less associated with certain
languages (e.g. Shift-JIS with Japanese); trying to support a given language or
set of customers may mean that certain encodings have to be supported. The
encodings that need to be supported may change over time. This document does
not give any advice on which encoding may be appropriate or necessary for the
support of any given language.
[I]
Software
MUST
completely implement the mechanisms for character
encoding identification and
SHOULD
implement them in such a
way that they are easy to use (for instance in HTTP servers).
[I]
On interfaces to other protocols, software
SHOULD
support conversion between Unicode encoding forms as
well as any other necessary conversions.
[C]
Content
MUST
make use of available facilities for character encoding
identification by always indicating character encoding; where the facilities
offered for character encoding identification include defaults (e.g. in XML 1.0
[XML 1.0]
), relying on such defaults is sufficient to satisfy this
identification requirement.
Because of the layered Web architecture (e.g. formats used over
protocols), there may be multiple and at times conflicting information about
character encoding.
[S]
Specifications
MUST
define conflict-resolution mechanisms (e.g. priorities)
for cases where there is multiple or conflicting information about character
encoding.
[I]
[C]
Software and content
MUST
carefully follow conflict-resolution mechanisms where
there is multiple or conflicting information about character
encoding.
3.6.3 Private Use Code Points
Unicode designates certain ranges of code points for private use:
the Private Use Area (U+E000-F8FF) and planes 15 and 16 (U+F0000-FFFFD and
U+100000-10FFFD). These code points are guaranteed to never be allocated to
standard characters, and are available for use by private agreement between a
producer and a recipient. However, their use is strongly discouraged, since
private agreements do not scale on the Web. Code points from different private
agreements may collide, and a private agreement and therefore the meaning of
the code points can quickly get lost.
[S]
Specifications
MUST
NOT
define any assignments of private use code
points.
[S]
Conformance to a
specification
MUST NOT
require the use of private use area
characters.
[S]
Specifications
SHOULD
NOT
provide mechanisms for agreement on private use code points
between parties and
MUST NOT
require the use of such
mechanisms.
[S]
[I]
Specifications and
implementations
SHOULD
be designed in such a way as to not
disallow the use of private use code points by private
arrangement.
As an example, XML does not disallow the use of
private use code points.
[S]
Specifications
MAY
define markup to allow the transmission of symbols not
in Unicode or to identify specific variants of Unicode
characters.
EXAMPLE:
MathML (see
[MathML2]
section
3.2.9
) defines an element
mglyph
for mathematical symbols
not in Unicode.
EXAMPLE:
SVG (see
[SVG]
section
10.14
) defines an element
altglyph
which allows the
identification of specific display variants of Unicode characters.
3.7 Character Escaping
In text-based protocols or formats where characters can be either
part of character data or of markup (see
3.5 Reference Processing Model
), it
is often the case that certain characters are designated as having certain
specific protocol/format functions in certain contexts (e.g.
' and '
' serve as markup delimiters in HTML
and XML). These syntax-significant characters cannot be used to represent
themselves in text data in the same way as all other characters do. Also, often
formats are represented in an encoding that does not allow to represent all
characters directly.
To express syntax-significant or unrepresentable characters, a
technique called
escaping
is used. This works by creating an
additional syntactic construct, defining additional characters or defining
character sequences that have special meaning. Escaping a character means
expressing it using such a construct, appropriate to the format or protocol in
which the character appears;
expanding an escape
(or
unescaping
) means replacing it with the character that it
represents.
Certain guidelines apply to the way specifications define character
escapes.
[S]
The guidelines in this document
relating to the
definition of character escapes
MUST
be followed when designing new W3C protocols and
formats and
SHOULD
be followed as much as possible when
revising existing protocols and formats.
[S]
Specifications
MUST NOT
invent a new escaping mechanism if an appropriate
one already exists.
[S]
The number of different
ways to escape a character
SHOULD
be minimized (ideally to
one).
[A well-known counter-example is that for historical
reasons, both HTML and XML have redundant decimal (&#ddddd;) and
hexadecimal (&#xhhhh;) escapes.]
[S]
Explicit end delimiters
MUST
be provided. Escapes such as \uABCD where the end
delimiter is a space or any character other than [01-9A-F]
SHOULD
be avoided.
These escapes are not
clear visually, and can cause an editor to insert spurious line-breaks when
word-wrapping on spaces. Forms like SPREAD's &UABCD;
[SPREAD]
or XML's &#xhhhh;, where the escape is explicitly terminated by a
semicolon, are much better.
[S]
Whenever specifications
define escapes that allow the representation of characters using a number the
number
SHOULD
be in hexadecimal
notation.
[S]
Escaped characters
SHOULD
be acceptable wherever unescaped characters are; this
does not preclude that a syntax-significant character, when escaped, loses its
significance in the syntax. In particular, escaped characters
SHOULD
be acceptable in identifiers and
comments.
Certain guidelines apply to content developers, as well as to
software that generates content:
[I]
[C]
Escapes
SHOULD
be avoided when the characters to be expressed are
representable in the character encoding of the document.
[I]
[C]
Since
character set standards usually list character numbers as hexadecimal, content
SHOULD
use the hexadecimal form of escapes when there is
one.
4 Early Uniform Normalization
This chapter discusses character data normalization for the Web.
4.1 Motivation
discusses the need for
normalization, and in particular early uniform normalization.
4.2 Definitions for W3C Text
Normalization
defines the various types of normalization and gives
examples.
4.3 Responsibility for
Normalization
assigns reponsibilities
to various components and situations.
4.1 Motivation
4.1.1 Why do we need character normalization?
Text in computers can be encoded in one of many encodings. In addition, some encodings allow multiple representations for the '
same
' string and Web languages have escape mechanisms that introduce even more equivalent representations. For instance, in ISO 8859-1 the letter '
can only be represented as the single character E7 '
', in a Unicode encoding it can be represented as the single character U+00E7 '
or
the sequence U+0063 '
' U+0302
', and in HTML it could be additionally represented as
ç
or
ç
or
ç
There are a number of fundamental operations that are sensitive to these multiple representations: string matching, indexing, searching, sorting, regular
expression matching, selection, etc. In particular, the proper functioning of the Web (and of much other software) depends to a large extent on
string matching. Examples of string
matching abound: parsing element and attribute names in Web documents, matching CSS selectors to
the nodes in a document, matching font names in a style sheet to the names known to the operating system, matching
URI pieces to the resources in a server, matching strings embedded in an ECMAscript program to strings typed in by
a Web form user, matching the parts of an XPath expression (element names, attribute names and values, content, etc.)
to what is found in an instance, etc.
String
matching is usually taken for granted and performed by comparing two strings byte for byte, but the
existence on the Web of multiple character representations means that it is actually non-trivial. Binary comparison
does not work
if the
strings are not in the same encoding (e.g. an EBCDIC style sheet being directly applied to an ASCII document, or a font
specification in a Shift-JIS style sheet directly used on a system that maintains font names in UTF-16) or if
they are in the same encoding but show variations allowed for the '
same
' string by the use of combining characters or by the constructs of the Web language.
Incorrect string matching can have far reaching
consequences, including the creation of security holes. Consider a contract,
encoded in XML, for buying goods: each item sold is described in an
artículo
element; unfortunately, "
artículo
" is subject to
different representations in the character encoding of the contract. Suppose
that the contract is viewed and signed by means of a user agent that looks for
artículo
elements, extracts them (matching on the element name),
presents them to the user and adds up their prices. If different instances of
the
artículo
element happen to be represented differently in a
particular contract, then the buyer and seller may see (and sign) different
contracts if their respective user agents perform string identity matching
differently, which is fairly likely in the absence of a well-defined
specification for string matching. The absence of a well-defined specification would also mean that
there would be no way to resolve the ensuing contractual dispute.
Solving the string matching problem involves normalization, which in a nutshell means bringing the two strings
to be compared to a common, canonical encoding prior to performing binary matching. (For additional steps involved in string matching see
6 String Identity Matching
.)
4.1.2 The choice of early uniform normalization
There are options in the exact way normalization can be used to achieve
correct behaviour of normalization-sensitive operations such as string
matching. These options lie along two axes:
The first axis is
a choice of
when
normalization occurs: early (when strings are created) or late (when strings are
compared). The former amounts to establishing a canonical encoding for all data that is transmitted or stored, so
that it doesn't need any normalization later, before being used. The latter is the equivalent of mandating
smart
' compare functions, which will take care of any encoding differences.
This document specifies
early
normalization. The reasons for that choice are manifold:
Almost all legacy data as well as data created by current
software is normalized (using NFC).
The number of Web components that generate or transform text
is considerably smaller than the number of components that receive text and
need to perform matching or other processes requiring normalized text.
Current receiving components (browsers, XML parsers, etc.)
implicitly assume early normalization by not performing or verifying normalization
themselves. This is a vast legacy.
Web components that generate and process text are in a much
better position to do normalization than other components; in particular, they
may be aware that they deal with a restricted repertoire only, which simplifies the process of normalization.
Not all components of the Web that implement functions such as
string matching can reasonably be expected to do normalization. This, in
particular, applies to very small components and components in the lower layers
of the architecture.
Forward-compatibility issues can be dealt with more easily:
less software needs to be updated, namely only the software that generates
newly introduced characters.
It improves matching in cases where the character encoding is
partly undefined, such as URIs
[RFC 2396]
in which non-ASCII bytes
have no defined meaning.
It is a prerequisite for comparison of encrypted strings (see
[CharReq]
section
2.7
).
The second axis is a choice of canonical encoding. This choice needs only be made if early normalization
is chosen. With late normalization, the canonical encoding would be an internal matter of the smart compare function,
which doesn't need any wide agreement or standardization.
4.2 Definitions for W3C Text
Normalization
The Unicode Consortium provides four standard normalization forms
(see
Unicode Normalization Forms
[UTR #15]
).
For use on the Web, this document defines Web-related text normalization forms by picking the
most appropriate of these, Unicode Normalization Form C (NFC), and additionally addressing the issues of
legacy encodings, character escapes, includes, and character and markup boundaries..
4.2.1 Unicode-normalized Text
Text data is, for the purposes of this specification,
Unicode-normalized
if it is in a
Unicode encoding form
and
is in Unicode Normalization Form C (according to version 3.1.0
of
[UTR #15]
).
Roughly speaking, NFC is defined such that each combining character
sequence (a base character followed by one or more combining characters) is
replaced, as far as possible, by a canonically equivalent precomposed character.
Text in a Unicode encoding form is said to be in NFC if it doesn't contain any
combining sequence that could be replaced and if any remaining combining
sequence is in canonical order.
4.2.2 Include-normalized Text
Markup languages, style languages and programming languages often offer
facilities for including a piece of text inside another. An
include
is an instance of a syntactic device specified in a
language to include an
entity
at the position of the include,
replacing the include itself. Examples of includes are entity references in
XML, @import rules in CSS and the element in XSLT.
Character escapes are a special case of includes where the included entity
is a single character.
Text data is
include-normalized
if:
the data is
Unicode-normalized
and
does not contain any character escapes or
includes whose expansion would cause the data to become no longer
Unicode-normalized; or
the data is in a legacy encoding
and
, if it were transcoded to a Unicode encoding form by a
normalizing transcoder
, the
resulting data would satisfy clause 1 above.
NOTE:
A consequence of this definition is that legacy text (i.e. text in a legacy encoding) is always include-normalized unless i) a normalizing transcoder cannot exist for that encoding (e.g. because the repertoire contains characters not in Unicode) or ii) the text contains escapes or includes which, once expanded, result in un-normalized text.
NOTE:
Include-normalization is specified against the context of a (computer) language (or the absence thereof), which specifies the form of escapes and includes. For plain text (no escapes or includes) in a Unicode encoding form, include-normalization and Unicode-normalization are equivalent.
4.2.3 Fully Normalized Text
During the normal processing of an
include-normalized entity, various pieces of the
data may be moved, removed (e.g. removing comments) or
merged (e.g. merging consecutive runs of character data at an entity boundary
or the
string()
function of XPath), creating opportunities for
text to become denormalized. One way to avoid such denormalization is to make
sure that the various pieces never begin with a
composing character
, defined
here as any character which can combine with a previous character in
NFC.
NOTE:
Conceptually, composing characters are the same as combining characters
as defined by Unicode (characters of non-zero combining class in the Unicode
Character Database). However, Unicode includes a few class-zero characters that
do compose
with a previous character in NFC. Therefore, composing
characters as defined here include all combining characters plus the following (as of Unicode 3.1):
U+09BE
bengali vowel sign
aa
U+0CD6
kannada ai length mark
U+09D7
bengali au length
mark
U+0D3E
malayalam vowel sign aa
U+0B3E
oriya vowel sign
aa
U+0D57
malayalam au length mark
U+0B56
oriya ai length
mark
U+0DCF
sinhala vowel sign
aela-pilla
U+0B57
oriya au length
mark
U+0DDF
sinhala vowel sing
gayanukitta
U+0BBE
tamil vowel sign
aa
U+0FB5
tibetan subjoined letter
ssa
U+0BD7
tamil au length
mark
U+0FB7
tibetan subjoined letter
ha
U+0CC2
kannada vowel sign
uu
U+102E
myanmar vowel sign ii
U+0CD5
kannada length mark
Formal languages define
constructs
which are identifiable pieces occuring in instances of the language such as
comments, element tags, processing instructions, runs of character data, etc.
Which of those constructs need to be constrained not to begin with a composing
character is language-dependent and depends on what processing the language
undergoes.
Text data is
fully normalized
if it is include-normalized and none of the spans composing the text begin with a non-starter character.
In the remainder of this specification,
normalized
is used to mean '
fully normalized
', unless
otherwise indicated.
NOTE:
Full normalization is specified against the
context of a (computer) language (or the absence thereof), which specifies
the form of escapes and includes and the separation into
constructs. For plain text (no includes,
no separation), full normalization
and Unicode-normalization are equivalent.
As specified in
4.3 Responsibility for
Normalization
, it is the responsiblity of the
specification for a language to specify exactly what constitutes a construct
for the purposes of the definition of full normalization. In general this will
be done by specifying important boundaries, the constructs being then defined
as the spans of text between the boundaries. At a minimum, for those languages
which have these notions, the important boundaries are entity (include)
boundaries as well as the boundaries between markup and character data. Many
languages will benefit from defining more boundaries and therefore
finer-grained full normalization constructs.
NOTE:
Full
normalization is closed under concatenation: the concatenation of two fully
normalized strings is also fully normalized. As a result, a side benefit of
including entity boundaries in the set of boundaries important for full
normalization is that the state of normalization of a document that includes
entities can be assessed
without
expanding the includes, if the
included entities are known to be fully normalized. If all the entities are
known to be include-normalized
and
not to start with a composing
character, then it can be concluded that including the entities would not
denormalize the document.
4.2.4 Examples
The string
suçon
', expressed as the sequence of five characters U+0073
U+0075 U+00E7 U+006F U+006E and encoded in a Unicode encoding form, is
Unicode-normalized, include-normalized and fully
normalized. The same string encoded in a legacy encoding for which there exists
a normalizing transcoder would be both include-normalized and fully
normalized but not Unicode-normalized (since not in a Unicode encoding
form).
In an XML or HTML context, the string
suçon
is also include-normalized, fully normalized and, if
encoded in a Unicode encoding form, Unicode-normalized. Expanding ç
yields
suçon
as above, which contains no replaceable combining
sequence.
The string '
suçon
', expressed as the sequence of
six
characters U+0073 U+0075
U+0063 U+0327
U+006F
U+006E (U+0327 is the
COMBINING CEDILLA
) and encoded in a
Unicode encoding form, is neither Unicode-normalized (since the combining
sequence U+0063 U+0327 is replaceable by the precomposed U+00E7
') nor include-normalized (since in
a Unicode encoding form but not Unicode-normalized) nor fully normalized
(since not include-normalized).
In an XML or HTML context, the
string '
suçon
' is not include-normalized, regardless of encoding form, because expanding ̧
yields the sequence '
suc¸on
' which is not Unicode-normalized
('
c¸
' is replaceable by '
'). Unicode-normalization,
however, is defined only for plain text, doesn't know that ̧
represents a character in XML or HTML and considers it just a sequence of
characters. Therefore, the string '
suçon
' in a Unicode
encoding form
is
Unicode-normalized since it contains no
replaceable combining sequence. (The latter example does not imply that
Unicode-normalization is sufficient to meet the normalization requirements of
the Web; it just illustrates a case where Unicode-normalization and
include-normalization differ).
In an XML
or HTML context, the strings '
¸on
' and
̧on
' are not fully normalized, as they begin with a
composing character (after expansion of the escape for the second). However,
both are Unicode-normalized (if expressed in a Unicode encoding) and
include-normalized.
The string '
/
foobar
', where the '
' immediately after

stands for the character U+0338
COMBINING LONG
SOLIDUS OVERLAY
, is neither Unicode-normalized nor
include-normalized, since the U+0338
' combines with the '
' (yielding U+226F
NOT GREATER-THAN
).
NOTE:
special case because it potentially corrupts the markup. maybe need more average example as well
From this example, it follows that it is impossible to produce a
normalized XML or HTML document containing the character U+0338
COMBINING LONG SOLIDUS OVERLAY
immediately following an element
tag, comment, CDATA section or processing instruction. It is noteworthy that
U+0338
COMBINING LONG SOLIDUS OVERLAY
also combines with
', yielding U+226E
NOT LESS-THAN
Consequently, U+0338
COMBINING LONG SOLIDUS OVERLAY
should
remain excluded from the initial character of XML identifiers.
[add example: may produce unnormalized after concatenation: non-starters]
[add example: Separable part of markup: identifier (element name, attribute name, attribute value, PI target, content of comment,..., start tag start character,...)]
[add example about an entity for a character without precomposed form]
[add example about separating out an accent for different styling: solution: SVG]
[add example about character properties: point to Mark's conversion format and to character collections NOTE]
[add example about display: use a (NB)space before]
[add example showing markup combining with following combining character (whether normalized out (combining slash) or not (just display problem)]
4.3 Responsibility for
Normalization
This section defines the responsibility for normalization, based on the goal of early uniform normalization.
Unless otherwise specified, the word '
normalization
' in this section may refer to '
include-normalization
' or '
full normalization
', depending on which is most appropriate for the specification or implementation under consideration.
An operation is
normalization-sensitive
if its output(s) are different depending on the state of normalization of the input(s); if the output(s) are textual, they are deemed different only if they would remain different were they to be normalized. These operations are any that involve comparison of characters or character counting, as well as some other operations such as ‘delete first character’ or ‘delete last character’.
text-processing component
is a component that recognizes data as text. This specification does not specify the boundaries of a text-processing component, which may be as small as one line of code or as large as a complete application. A text-processing component may receive text, produce text, or both.
Certified text
is text which satisfies at least one of the following conditions:
the text has been successfully validated for normalization, or
the source text-processing component is identified and is known to produce only normalized text.
Suspect text
is text which is not certified.
[C]
All text content on the Web
MUST
be in include-normalized form and
SHOULD
be in fully normalized form.
[S]
Specifications of text-based formats and protocols
MUST
, as part of their syntax definition, require that the text be in normalized form.
[S]
[I]
A text-processing component that receives suspect text
MUST NOT
perform any normalization-sensitive operations unless it has first successfully validated the text for normalization, and
MUST NOT
normalize the suspect text. Private agreements
MAY
, however, be created within private systems which are not subject to these rules, but any externally observable results
MUST
be the same as if the rules had been obeyed.
[I]
A text-processing component which modifies text and performs normalization-sensitive operations
MUST
behave
as if
normalization took place after each modification, so that any subsequent normalization-sensitive operations always behave
as if
they were dealing with normalized data.
EXAMPLE:
If the '
' is deleted from the (normalized) string
cz¸
(where '
' represents a combining cedilla, U+0327), normalization is necessary to turn the denormalized result
c¸
into the properly normalized
. Analogous cases exist for insertion and concatenation. If the software that deletes the '
' later uses the string in a normalization-sensitive operation, it needs to normalize the string before this operation to ensure correctness; otherwise, normalization may be deferred until the data is exposed.
[S]
Specifications of text-based languages and protocols
SHOULD
define precisely the construct boundaries necessary to obtain a complete definition of full normalization. These definitions
MUST
include at least the boundaries between markup and character data as well as entity boundaries (if the language has any include mechanism) and
SHOULD
include any other boundary that may create denormalization when instances of the language are processed.
[S]
Specifications
MUST
document any security issues related to normalization.
[I]
Implementations which transcode text data from a legacy encoding to a Unicode encoding form
MUST
use a normalizing transcoder.
[S]
Specifications of API components (functions/methods) that perform operations that may produce unnormalized text output from normalized text input
MUST
define whether normalization is the responsibility of the caller or the callee. Specifications
MAY
make performing normalization optional for some API components; in this case the default
SHOULD
be that normalization is performed, and an explicit option
SHOULD
be used to switch normalization off. Specifications
MUST NOT
make the implementation of normalization optional.
[S]
Specifications that define a mechanism (for example an API or a defining language) for producing a document SHOULD require that the final output of this mechanism be normalized.
examples: DOM (load and) save, XSLT
NOTE:
As an optimization, it is perfectly acceptable for a
system
to define the producer to be the actual producer (e.g. a
small device) together with a remote component (e.g. a server serving as a kind
of proxy) to which normalization is delegated. In such a case, the
communications channel between the device and proxy server is considered to be
internal
to the system, not part of the Web. Only data normalized
by the proxy server is to be exposed to the Web at large, as shown in the
illustration below:
Illustration of a text producer defined as including a
proxy.
5 Compatibility and Formatting
Characters
This specification does not address the suitability of particular
characters for use in markup languages, in particular formatting characters and
compatibility equivalents. For detailed recommendations about the use of
compatibility and formatting characters, see
Unicode in XML and other
Markup Languages
[UXML]
[S]
Specifications
SHOULD
exclude compatibility characters in the syntactic
elements (markup, delimiters, identifiers) of the formats they
define.
6 String Identity Matching
One important operation that depends on early normalization is
string identity matching
[CharReq]
, which is a
subset of the more general problem of string matching. There are various
degrees of specificity for string matching, from approximate matching such as
regular expressions or phonetic matching, to more specific matches such as
case-insensitive or accent-insensitive matching and finally to identity
matching. In the Web environment, where multiple encodings are used to
represent strings, including some encodings which allow multiple
representations for the same thing,
identity
is defined to occur
if and only if the compared strings contain no user-identifiable distinctions.
This definition is such that strings do not match when they differ in case or
accentuation, but do match when they differ only in non-semantically
significant ways such as encoding, use of escapes (of potentially different
kinds), or use of precomposed vs. decomposed character sequences.
To avoid unnecessary conversions and, more importantly,
to ensure predictability and correctness, it is necessary for all components of
the Web to use the same identity testing mechanism. Conformance to the rule
that follows meets this requirement and supports the above definition of
identity.
[S]
[I]
String
identity matching
MUST
be performed as if the following
steps were followed:
Early uniform normalization to fully normalized form, as defined
in
4.2.3 Fully Normalized Text
. In accordance with section
4 Early Uniform Normalization
, this step
MUST
be
performed by the
producers
of the strings to be compared.
Conversion to a common encoding of UCS, if necessary.
Expansion of all escapes.
Testing for bit-by-bit identity.
Step 1 ensures 1) that the identity matching process can produce
correct results using the next three steps and 2) that a minimum of effort is
spent on solving the problem.
[S]
[I]
Forms of
string matching other than identity
SHOULD
be based on the
steps
specified in this document for
string identity matching.
Taking into account normalization
and escapes is necessary so that, for example, a case-insensitive match of
suçon
against
sucçon
or against
SUC¸ON
returns
TRUE
NOTE:
The expansion of escapes (step 3 above) is dependent on context,
i.e. on which markup or programming language is considered to apply when the
string matching operation is performed. Consider a search for the string
suçon
' in an XML document containing
sucçon
but not
suçon
. If the search is performed in a plain text editor, the context is
plain text
(no markup or programming language applies), the
ç escape is not recognized, hence not expanded and the search fails.
If the search is performed in an XML browser, the context is
XML
the escape (defined by XML) is expanded and the search succeeds.
An intermediate case would be an XML editor that
purposefully
provides a view of an XML document with entity
references left unexpanded. In that case, a search over that pseudo-XML view
will deliberately
not
expand entities: in that particular context,
entity references are not considered escapes and need not be expanded.
7 String Indexing
There are many situations where a software process needs to access a
substring or to point within a string and does so by the use of
indices
, i.e. numeric "
positions
" within a string.
Where such indices are exchanged between components of the Web, there is a need
for an agreed-upon definition of string indexing in order to ensure consistent
behavior. The requirements for string indexing are discussed in
Requirements for String Identity Matching
[CharReq]
section 4
. The two
main questions that arise are: "
What is the unit of counting?
" and
Do we start counting at 0 or 1?
".
Depending on the particular requirements of a process, the unit of
counting may correspond to any of the definitions of a string provided in
section
3.4 Strings
. In particular:
[S]
[I]
The
character string
is
RECOMMENDED
as a basis for string indexing.
(Example: the XML Path Language
[XPath]
).
[S]
[I]
code unit string
MAY
be used as a basis for string indexing if this results
in a significant improvement in the efficiency of internal operations when
compared to the use of character string.
(Example: the use of
UTF-16 in
[DOM Level 1]
).
Counting
graphemes
will
become a good option where user interaction is the primary concern, once a
suitable definition is widely accepted.
It is noteworthy that there exist other, non-numeric ways of
identifying substrings which have favorable properties. For instance,
substrings based on string matching are quite robust against small edits;
substrings based on document structure (in structured formats such as XML) are
even more robust against edits and even against translation of a document from
one human language to another.
[S]
Specifications that need a way to identify
substrings or point within a string
SHOULD
provide ways
other than string indexing to perform this operation.
[I]
[C]
Users of
specifications (software developers, content developers)
SHOULD
whenever possible prefer ways other than string
indexing to identify substrings or point within a string.
Experience shows that more general, flexible and robust specifications
result when individual characters are understood and processed as substrings,
identified by a position before and a position after the substring.
Understanding indices as boundary positions
between
the counting
units also makes it easier to relate the indices resulting from the different
string definitions.
[S]
Specifications
SHOULD
understand and process single characters as
substrings, and treat indices as boundary positions
between
counting units, regardless of the choice of counting
units.
[S]
Specifications of APIs
SHOULD NOT
specify single character or single encoding-unit
arguments.
EXAMPLE:
uppercase('ß')
cannot return the proper result (the two-character string
SS
') if the return type of the
uppercase
function is defined to be a single character.
The issue of index origin, i.e. whether we count from 0 or 1, actually
arises only after a decision has been made on whether it is the units
themselves that are counted or the positions between the units.
[S]
When the positions between the units are
counted for string indexing, starting with an index of 0 for the position at
the start of the string is the
RECOMMENDED
solution, with
the last index then being equal to the number of counting units in the
string.
8 Character Encoding in URI References
According to the definition in RFC 2396
[RFC 2396]
, URI
references are restricted to a subset of US-ASCII, with an escaping mechanism
to encode arbitrary byte values, using the %HH convention. However, the %HH
convention by itself is of limited use because there is no definitive mapping
from characters to bytes. Also, non-ASCII characters cannot be used directly.
Internationalized Resource Identifiers (IRI)
[I-D URI-I18N]
solves both problems with an uniform approach that
conforms to the
Reference Processing Model
[S]
W3C specifications that define
protocol or format elements (e.g. HTTP headers, XML attributes, etc.) which are
to be interpreted as URI references (or specific subsets of URI references,
such as absolute URI references, URIs, etc.)
SHOULD
use
Internationalized Resource Identifiers (IRI)
[I-D URI-I18N]
(or an appropriate subset thereof).
[S]
W3C specifications
MUST
define when the conversion from IRI references to URI references (or subsets
thereof) takes place, in accordance with
Internationalized Resource
Identifiers (IRI)
[I-D URI-I18N]
NOTE:
Many current W3C specifications already contain provisions in
accordance with
Internationalized Resource Identifiers
(IRI)
[I-D URI-I18N]
. For XML 1.0
[XML 1.0]
see
Section
4.2.2, External Entities
, and
Erratum
E26
. XML Schema Part 2: Datatypes
[XML Schema-2]
provides the
anyURI
datatype (see
Section
3.2.17
). The XML Linking Language (XLink)
[XLink]
provides the href attribute (see
Section 5.4, Locator
Attribute
). Further information and links can be found at
Internationalization: URIs and other identifiers
[Info URI-I18N]
[S]
W3C specifications that define
new syntax for URIs, such as a new URI scheme or a new kind of fragment
identifier,
MUST
specify that characters outside the
US-ASCII repertoire are encoded using UTF-8 and %HH-escaping, in accordance
with
Guidelines for new URL Schemes
[RFC 2718]
, Section 2.2.5.
This will make sure that these
schemes or fragment identifiers can be used in IRIs in the natural way.
9 Referencing the Unicode Standard and
ISO/IEC 10646
Specifications often need to make references to the Unicode standard
or International Standard ISO/IEC 10646. Such references must be made with
care, especially when normative. The questions to be considered are:
Which standard should be referenced?
How to reference a particular version?
When to use versioned vs unversioned references?
ISO/IEC 10646 is developed and published jointly by
ISO
(the International
Organisation for Standardisation) and
IEC
(the International
Electrotechnical Commission). The Unicode Standard is developed and published
by the
Unicode Consortium
, an
organization of major computer corporations, software producers, database
vendors, national governments, research institutions, international agencies,
various user groups, and interested individuals. The Unicode Standard is
comparable in standing to W3C Recommendations.
ISO/IEC 10646 and Unicode define exactly the same CCS (same
repertoire, same code points) and encoding forms. They are actively maintained
in synchrony by liaisons and overlapping membership between the respective
technical committees. In addition to the jointly defined CCS and encoding
forms, the Unicode Standard adds normative and informative lists of character
properties, normative character equivalence and normalization specifications, a
normative algorithm for bidirectional text and a large amount of useful
implementation information. In short, Unicode adds semantics to the characters
that ISO/IEC 10646 merely enumerates. Conformance to Unicode implies
conformance to ISO/IEC 10646, see
[Unicode 3.0]
Appendix C.
[S]
Since specifications in general
need both a definition for their characters and the semantics associated with
these characters, specifications
SHOULD
include a reference
to the Unicode Standard, whether or not they include a reference to ISO/IEC
10646.
By providing a reference to The Unicode Standard
implementers can benefit from the wealth of information provided in the
standard and on the Unicode Consortium Web site.
The fact that both ISO/IEC 10646 and Unicode are evolving (in
synchrony) raises the issue of versioning: should a specification refer to a
specific version of the standard, or should it make a generic reference, so
that the normative reference is to the version current at the time of
reading
the specification? In general the answer is
both
[S]
A generic reference to
the Unicode Standard
MUST
be made if it is desired that
characters allocated after a specification is published are usable with that
specification. A specific reference to the Unicode Standard
MAY
be included to ensure that functionality depending on a
particular version is available and will not change over time (an example would
be the set of characters acceptable as Name characters in XML 1.0
[XML 1.0]
, which is an enumerated list that parsers must implement
to validate names).
NOTE:
See
for guidance
on referring to specific versions of Unicode.
A generic reference can be formulated in two ways:
By explicitly including a
generic
entry in the
bibliography section of a specification and simply referring to that entry in
the body of the specification. Such a generic entry contains text such as
... as it may from time to time be revised or amended
".
By including a
specific
entry in the bibliography
and adding text such as "
... as it may from time to time be revised or
amended
" at the point of reference in the body of the specification.
It is an editorial matter, best left to each specification, which of
these two formulations is used. Examples of the first formulation can be found
in the bibliography of this specification (see the entries for
[ISO/IEC 10646]
and
[Unicode]
). Examples of the latter,
as well as a discussion of the versioning issue with respect to MIME
charset
parameters for UCS encodings, can be found in
[RFC 2279]
and
[RFC 2781]
[S]
All
generic
references to Unicode
MUST
refer to Unicode 3.0
[Unicode 3.0]
or later.
[S]
Generic references to ISO/IEC 10646
MUST
be written such that they make allowance for the future
publication of additional
parts
of the standard. They
MUST
refer to ISO/IEC 10646-1:2000
[ISO/IEC 10646-1:2000]
or later, including any
amendments.
A Examples of Characters, Keystrokes and
Glyphs
A few examples will help make sense all this complexity
of text in computers (which is mostly a reflection of the complexity of human
writing systems). Let us start with a very simple example: a user, equipped
with a US-English keyboard, types "
Foo
", which the computer
encodes as 16-bit values (the UTF-16 encoding of Unicode) and displays on the
screen.
Keystrokes
Shift-f
Input characters
Encoded characters (byte values
in hex)
0046
006F
006F
Display
Foo
Example A.1: Basic Latin
The only complexity here is the use of a modifier (Shift) to input the
capital '
'.
A slightly more complex example is a user typing '
çé
' on
a traditional French-Canadian keyboard, which the computer again encodes in
UTF-16 and displays. We assume that this particular computer uses a fully
composed form of UTF-16.
Keystrokes
Input characters
Encoded characters (byte values
in hex)
00E7
00E9
Display
çé
Example A.2: Latin with diacritics
A few interesting things are happening here: when the user types the
cedilla ('
'), nothing happens except for a change of state of the
keyboard driver; the cedilla is a
dead key
. When the driver gets
the c keystroke, it provides a complete '
' character to the
system, which represents it as a single 16-bit code unit and displays a
' glyph. The user then presses the dedicated '
key, which results in, again, a character represented by two bytes. Most
systems will display this as one glyph, but it is also possible to combine two
glyphs (the base letter and the accent) to obtain the same rendering.
On to a Japanese example: our user employs a
romaji input
method
to type "
", which the
computer encodes in UTF-16 and displays.
Keystrokes
n i h o n g o

Input characters
Encoded characters (byte values
in hex)
65E5
672C
8A9E
Display
Example A.3: Japanese
The interesting aspect here is input: the user types Latin characters,
which are converted on the fly to kana (not shown here), and then to kanji when
the user requests conversion by pressing ; the kanji characters
are finally sent to the application when the user presses . The
user has to type a total of nine keystrokes before the three characters are
produced, which are then encoded and displayed rather trivially.
An Arabic example will show different phenomena:
Keystrokes
Input characters
Encoded characters (byte
values in hex)
0644
0627
0644
0627
0639
0639
Display
Example A.4: Arabic
Here the first two keystrokes each produce an input character and an
encoded character, but the pair is displayed as a single glyph
('
', a lam-alef ligature). The next keystroke
is a lam-alef, which some Arabic keyboards have; it produces the same two
characters which are displayed similarly, but this second lam-alef is placed to
the
left
of the first one when displayed. The last two keystrokes
produce two identical characters which are rendered by two different glyphs (a
medial form followed to its left by a final form). We thus have 5 keystrokes
producing 6 characters and 4 glyphs laid out right-to-left.
A final example in Tamil, typed with an ISCII
keyboard, will illustrate some additional phenomena:
Keystrokes
Input characters
Encoded characters (byte values
in hex)
0B9F
0BBE
0B99
0BCD
0B95
0BCB
Display
Example A.5: Tamil
Here input is straightforward, but note that contrary to the preceding
accented Latin example, the diacritic '
' (
virama
, vowel killer) is entered
after
the '
' to which it applies. Rendering is
interesting for the last two characters. The last one ('
')
clearly consists of two glyphs which
surround
the glyph of the
next to last character ('
').
A number of operations routinely performed on
text can be impacted by the complexities of the world's writing systems. An
example is the operation of selecting text on screen by a pointing device in a
bidirectional (bidi) context (see
3.1.3 Units of Visual
Rendering
).
Let's have a look at some bidi text, in this case Arabic letters (written
right-to-left) mixed with Arabic-Hindi digits (left-to-right):
In memory
On screen
Example A.6: Bidirectional text
B Acknowledgements
Special thanks go to Ian Jacobs for ample help with editing. Tim
Berners-Lee and James Clark provided important details in the section on URIs.
The W3C I18N WG and IG, as well as others, provided many comments and
suggestions.
C References
C.1 Normative
References
IANA
Internet Assigned Numbers Authority,
Official
Names for Character Sets
. (See
ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
.)
ISO/IEC 10646
ISO/IEC 10646-1:2000,
Information
technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1:
Architecture and Basic Multilingual Plane
, as, from time to time,
amended, replaced by a new edition or expanded by the addition of new parts.
(See
for the
latest version.)
ISO/IEC 10646-1:2000
ISO/IEC
10646-1:2000,
Information
technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1:
Architecture and Basic Multilingual Plane
. (See
.)
MIME
Multipurpose Internet Mail
Extensions (MIME). Part One: Format of Internet Message Bodies
, N.
Freed, N. Borenstein, RFC 2045, November 1996,
Part Two: Media Types
, N. Freed, N. Borenstein, RFC 2046,
November 1996.
Part Three: Message Header Extensions for Non-ASCII
Text
, K. Moore, RFC 2047, November 1996.
Part Four:
Registration Procedures
, N. Freed, J. Klensin, J. Postel, RFC 2048,
November 1996.
Part Five: Conformance Criteria and
Examples
, N. Freed, N. Borenstein, RFC 2049, November 1996.
RFC 2070
F. Yergeau, G. Nicol, G. Adams, M.
Dürst,
Internationalization of the
Hypertext Markup Language
, IETF RFC 2070, January 1997. (See
.)
RFC 2119
S. Bradner,
Key words for use in RFCs
to Indicate Requirement Levels
, IETF RFC 2119. (See
.)
RFC 2396
T. Berners-Lee, R. Fielding, L.
Masinter,
Uniform Resource
Identifiers (URI): Generic Syntax
, IETF RFC 2396, August 1998. (See
.)
RFC 2732
R. Hinden, B. Carpenter, L.
Masinter,
Format for
Literal IPv6 Addresses in URL's
, IETF RFC 2732, 1999. (See
.)
Unicode
The Unicode Consortium,
The Unicode Standard -- Version 3.0
, ISBN 0-201-61633-5,
as updated from time to time by the publication of new versions. (See
for the latest version and additional information on versions of the standard
and of the Unicode Character Database).
Unicode 3.0
The Unicode Consortium,
The Unicode Standard -- Version 3.0
, ISBN 0-201-61633-5.
(See
.)
UTR #15
Mark Davis, Martin Dürst,
Unicode
Normalization Forms,
Unicode Standard Annex #15. (See
for the latest version).
Version
3.1.0
(March 2001) is at
C.2 Other References
CharReq
Martin J. Dürst,
Requirements for String
Identity and Character Indexing Definitions for the WWW
, W3C Working
Draft. (See
.)
Connolly
D. Connolly,
Character
Set Considered Harmful
, W3C Note. (See
.)
CSS2
Bert Bos, Håkon Wium Lie, Chris Lilley,
Ian Jacobs, Eds.,
Cascading
Style Sheets, level 2
(CSS2 Specification), W3C Recommendation. (See
.)
DOM Level 1
Vidur Apparao et al.,
Document Object Model
(DOM) Level 1 Specification
, W3C Recommendation. (See
.)
HTML 4.0
Dave Raggett, Arnaud Le Hors, Ian
Jacobs, Eds.,
HTML 4.0
Specification
, W3C Recommendation, 18-Dec-1997 (See
.)
HTML 4.01
Dave Raggett, Arnaud Le Hors, Ian
Jacobs, Eds.,
HTML 4.01
Specification
, W3C Recommendation, 24-Dec-1999. (See
.)
I-D URI-I18N
Larry Masinter, Martin Dürst,
Internationalized
Resource Identifiers (IRI)
, Internet-Draft, November 2001. (See
.)
Info URI-I18N
Internationalization:
URIs and other identifiers
. (See
.)
ISO/IEC 14651
ISO/IEC 14651:2000,
Information
technology -- International string
ordering and comparison -- Method for comparing character strings and
description of the common template tailorable ordering
as, from time
to time, amended, replaced by a new edition or expanded by the addition
of new parts. (See
for the latest version.)
ISO/IEC 9541-1
ISO/IEC 9541-1:1991,
Information
technology -- Font information interchange -- Part 1: Architecture
. (See
for the latest version.)
MathML2
David Carlisle, Patrick Ion, Robert
Miner, Nico Poppelier, Eds.,
Mathematical Markup Language (MathML)
Version 2.0
, W3C Recommendation, 21 February 2001. (See
.)
Nicol
Gavin Nicol,
The
Multilingual World Wide Web
, Chapter 2: The WWW As A Multilingual
Application. (See
.)
RFC 2070
F. Yergeau, G. Nicol, G. Adams, M.
Dürst,
Internationalization of the
Hypertext Markup Language
, IETF RFC 2070, January 1997. (See
.)
RFC 2277
H. Alvestrand,
IETF Policy on Character
Sets and Languages
, IETF RFC 2277, BCP 18, January 1998. (See
.)
RFC 2279
F. Yergeau,
UTF-8, a transformation
format of ISO 10646
, IETF RFC 2279, January 1998. (See
.)
RFC 2718
L. Masinter, H. Alvestrand, D.
Zigmond, R. Petke,
Guidelines for new URL
Schemes
, IETF RFC 2718, November 1999. (See
.)
RFC 2781
P. Hoffman, F. Yergeau,
UTF-16, an encoding of ISO
10646
, IETF RFC 2781, February 2000. (See
.)
SPREAD
SPREAD -
Standardization Project for East Asian Documents Universal Public Entity
Set
. (See
SVG
Jon Ferraiolo, Ed.,
Scalable Vector Graphics (SVG) 1.0
Specification
, W3C Recommendation, 4 September 2001. (See
.)
UTR #10
Mark Davis,
Ken Whistler,
Unicode Collation Algorithm
, Unicode Technical Report #10. (See
.)
UTR #17
Ken Whistler, Mark Davis,
Character
Encoding Model
, Unicode Technical Report #17. (See
.)
UXML
Martin Dürst and Asmus Freytag,
Unicode in XML and other
Markup Languages
, Unicode Technical Report #20 and W3C Note. (See
.)
XLink
Steve DeRose, Eve Maler, David Orchard,
Eds,
XML Linking Language (XLink)
Version 1.0
, W3C Recommendation, 27 June 2001. (See
.)
XML 1.0
Tim Bray, Jean Paoli, C. M.
Sperberg-McQueen, Eve Maler, Eds.,
Extensible Markup Language (XML)
1.0
, W3C Recommendation. (See
.)
XML Schema-2
Paul V. Biron , Ashok
Malhotra , Eds.,
XML Schema
Part 2: Datatypes
, W3C Recommendation. (See
.)
XML Japanese Profile
MURATA
Makoto Ed.,
XML Japanese
Profile
, W3C Note. (See
.)
XPath
James Clark, Steve DeRose, Eds,
XML Path Language (XPath) Version
1.0
, W3C Recommendation, 16 November 1999. (See
.)
XPointer
Steve DeRose, Eve Maler, Ron
Daniel Jr., Eds,
XML Pointer Language
(XPointer) Version 1.0
, W3C Candidate Recommendation, 11 September
2001. (See
.)
D Change Log (Non-Normative)
D.1 Changes since
Replaced much of chapter 8 content with references to
[I-D URI-I18N]
Made numerous additional changes listed in
Character
Model for the World Wide Web 1.0 Last Call Comments
(Members
only).
Converted to XHTML with UTF-8 encoding.
D.2 Changes since
Normalization: changed from "
recipients
MUST
NOT
normalize" to "recipients
MUST
check and
reject un-normalized data
".
Clarified conformance model, in particular introduced [S][I][C]
specifiers for requirements.
Made numerous other changes listed in
Character
Model for the World Wide Web 1.0 Last Call Comments
(Members
only).
Fixed countless typos and unclear/ambiguous sentences.
Updated references.