UTS #35: Unicode Locale Data Markup Language
Technical Reports
Unicode Technical Standard #35
Unicode Locale Data Markup Language (LDML)
Version
36
Editors
Mark Davis (
markdavis@google.com
) and
other CLDR committee members
Date
2019-10-02
This Version
Previous Version
Latest Version
Corrigenda
Latest Proposed Update
Namespace
DTDs
Revision
57
Summary
This document describes an XML format (
vocabulary
for the exchange of structured locale data. This format is used
in the
Unicode Common Locale
Data Repository
Status
This document has been reviewed by Unicode members and
other interested parties, and has been approved for publication
by the Unicode Consortium. This is a stable document and may be
used as reference material or cited as a normative reference by
other specifications.
A Unicode Technical Standard (UTS)
is an
independent specification. Conformance to the Unicode
Standard does not imply conformance to any UTS.
Please submit corrigenda and other comments with the CLDR
bug reporting form [
Bugs
]. Related
information that is useful in understanding this document is
found in the
References
. For the
latest version of the Unicode Standard see [
Unicode
]. For a
list of current Unicode Technical Reports see [
Reports
]. For more
information about versions of the Unicode Standard, see
Versions
].
Parts
The LDML specification is divided into the following
parts:
Part 1:
Core
(languages,
locales, basic structure)
Part 2:
General
(display names & transforms, etc.)
Part 3:
Numbers
(number & currency formatting)
Part 4:
Dates
(date, time, time zone formatting)
Part 5:
Collation
(sorting,
searching, grouping)
Part 6:
Supplemental
(supplemental
data)
Part 7:
Keyboards
(keyboard
mappings)
Contents
of Part 1, Core
Introduction
1.1
Conformance
What is a Locale?
Unicode Language and Locale
Identifiers
3.1
Unicode
Language Identifier
3.2
Unicode
Locale Identifier
3.2.1 Canonical Unicode Locale Identifiers
3.3
BCP 47
Conformance
3.3.1
BCP 47 Language Tag
Conversion
3.4
Language Identifier
Field Definitions
Table:
Language
Identifier Field Definitions
3.5
Special Codes
3.5.1
Unknown or Invalid
Identifiers
3.5.2
Numeric
Codes
3.5.3
Private Use
Codes
Table:
Private
Use Codes in CLDR
3.6
Unicode BCP 47 U
Extension
3.6.1
Key
And Type Definitions
Table:
Key/Type
Definitions
3.6.2
Numbering System
Data
3.6.3
Time Zone
Identifiers
3.6.4
U Extension
Data Files
3.6.5
Subdivision Codes
3.6.5.1
Validity
3.7
Unicode BCP 47 T
Extension
3.7.1
Extension Data Files
3.8
Compatibility with Older Identifiers
3.8.1
Old
Locale Extension Syntax
Table:
Locale Extension
Mappings
3.8.2
Legacy
Variants
Table:
Legacy Variant
Mappings
3.8.3
Relation to
OpenI18n
3.9
Transmitting Locale
Information
3.9.1
Message
Formatting and Exceptions
3.10
Unicode
Language and Locale IDs
3.10.1
Written
Language
3.10.2
Hybrid Locale
Identifiers
3.11
Validity Data
Locale Inheritance and
Matching
4.1
Lookup
4.1.1
Bundle vs
Item Lookup
Table:
Lookup
Differences
4.1.2
Lateral
Inheritance
Table:
Count
Fallback: normal
Table:
Count Fallback:
currency
4.1.3
Parent
Locales
4.2
Inheritance
and Validity
4.2.1
Definitions
4.2.2
Resolved Data
File
4.2.3
Valid Data
4.2.4
Checking for Draft
Status
4.2.5
Keyword and Default
Resolution
4.2.6
Inheritance vs Related
Information
4.3
Likely Subtags
4.4
Language Matching
4.4.1
Enhanced Language
Matching
XML Format
5.1
Common Elements
5.1.1
Element special
5.1.1.1
Sample Special
Elements
5.1.2
Element alias
Table:
Inheritance
with source="locale"
5.1.3
Element
displayName
5.1.4
Escaping
Characters
5.2
Common
Attributes
5.2.1
Attribute
type
5.2.2
Attribute
draft
5.2.3
Attribute
alt
5.3
Common
Structures
5.3.1
Date and Date
Ranges
5.3.2
Text
Directionality
5.3.3
Unicode Sets
5.3.3.1
Lists
of Code Points
5.3.3.2
Unicode
Properties
5.3.3.3
Boolean
Operations
5.3.3.4
UnicodeSet
Examples
5.3.4
String
Range
5.4
Identity
Elements
5.5
Valid Attribute
Values
5.6
Canonical Form
5.6.1
Content
5.6.2
Ordering
5.6.3
Comments
5.7
DTD
Annotations
5.7.1
Attribute Value Constraints
Property Data
6.1
Script
Metadata
6.2
Extended
Pictographic
6.3
Labels.txt
6.4 Segmentation Tests
Issues in Formatting and
Parsing
7.1
Lenient Parsing
7.1.1
Motivation
7.1.2
Loose
Matching
7.2
Handling Invalid
Patterns
Annex A
Deprecated
Structure
A.1
Element
fallback
A.2
BCP 47 Keyword
Mapping
A.3
Choice
Patterns
A.4
Element
default
A.5
Deprecated Common
Attributes
A.5.1
Attribute
standard
A.5.2
Attribute draft in
non-leaf elements
A.6
Element base
A.7
Element rules
A.8
Deprecated subelements
of
A.9
Deprecated
subelements of
A.10
Deprecated
subelements of
A.11
Deprecated
subelements of
A.12
Renamed
attribute values for
element
A.13
Deprecated
subelements of
A.14
Element cp
A.15
Attribute
validSubLocales
A.16
Elements
postalCodeData, postCodeRegex
A.17
Element
telephoneCodeData
Annex B
Links to Other
Parts
Table:
Part 2 Links: General
(display names & transforms, etc.)
Table:
Part 3 Links: Numbers
(number & currency formatting)
Table:
Part 4 Links: Dates
(date, time, time zone formatting)
Table:
Part 5 Links:
Collation (sorting, searching, grouping)
Table:
Part 6 Links:
Supplemental (supplemental data)
Table:
Part 7 Links:
Keyboards (keyboard mappings)
References
Acknowledgments
Modifications
1 Introduction
Not long ago, computer systems were like separate worlds,
isolated from one another. The internet and related events have
changed all that. A single system can be built of many
different components, hardware and software, all needing to
work together. Many different technologies have been important
in bridging the gaps; in the internationalization arena,
Unicode has provided a lingua franca for communicating textual
data. However, there remain differences in the locale data used
by different systems.
The best practice for internationalization is to store and
communicate language-neutral data, and format that data for the
client. This formatting can take place on any of a number of
the components in a system; a server might format data based on
the user's locale, or it could be that a client machine does
the formatting. The same goes for parsing data, and
locale-sensitive analysis of data.
But there remain significant differences across systems and
applications in the locale-sensitive data used for such
formatting, parsing, and analysis. Many of those differences
are simply gratuitous; all within acceptable limits for human
beings, but yielding different results. In many other cases
there are outright errors. Whatever the cause, the differences
can cause discrepancies to creep into a heterogeneous system.
This is especially serious in the case of collation
(sort-order), where different collation caused not only
ordering differences, but also different results of queries!
That is, with a query of customers with names between "Abbot,
Cosmo" and "Arnold, James", if different systems have different
sort orders, different lists will be returned. (For comparisons
across systems formatted as HTML tables, see [
Comparisons
].)
Note:
There are many different equally
valid ways in which data can be judged to be "correct" for a
particular locale. The goal for the common locale data is to
make it as consistent as possible with existing locale data,
and acceptable to users in that locale.
This document specifies an XML format for the communication
of locale data: the Unicode Locale Data Markup Language (LDML).
This provides a common format for systems to interchange locale
data so that they can get the same results in the services
provided by internationalization libraries. It also provides a
standard format that can allow users to customize the behavior
of a system. With it, for example, collation (sorting) rules
can be exchanged, allowing two implementations to exchange a
specification of tailored collation rules. Using the same
specification, the two implementations will achieve the same
results in comparing strings. Unicode LDML can also be used to
let a user encapsulate specialized sorting behavior for a
specific domain, or create a customized locale for a minority
language. Unicode LDML is also used in the Unicode Common
Locale Data Repository (CLDR). CLDR uses an open process for
reconciling differences between the locale data used on
different systems and validating the data, to produce with a
useful, common, consistent base of locale data.
For more information, see the Common Locale Data Repository
project page [
LocaleProject
].
As LDML is an interchange format, it was designed for ease
of maintenance and simplicity of transformation into other
formats, above efficiency of run-time lookup and use.
Implementations should consider converting LDML data into a
more compact format prior to use.
1.1 Conformance
There are many ways to use the Unicode LDML format and the
data in CLDR, and the Unicode Consortium does not restrict the
ways in which the format or data are used. However, an
implementation may also claim conformance to LDML or to CLDR,
as follows:
UAX35-C1.
An implementation that claims
conformance to this specification shall:
Identify the sections of the specification that it
conforms to.
For example, an implementation might claim
conformance to all LDML features except for
transforms
and
segments
Interpret the relevant elements and attributes of LDML
documents in accordance with the descriptions in those
sections.
For example, an implementation that claims
conformance to the date format patterns must interpret
the characters in such patterns according to
Date Field
Symbol Table
Declare which types of CLDR data that it uses.
For example, an implementation might declare that it
only uses language names, and those with a
draft
status of
contributed
or
approved
UAX35-C2.
An implementation that claims
conformance to Unicode locale or language identifiers
shall:
Specify whether Unicode locale extensions are
allowed
Specify the canonical form used for identifiers in terms
of casing and field separator characters.
External specifications may also reference particular
components of Unicode locale or language identifiers, such
as:
Field X can contain any Unicode region subtag values as
given in Unicode Technical Standard #35: Unicode Locale Data
Markup Language (LDML), excluding grouping codes.
2 What is a
Locale?
Before diving into the XML structure, it is helpful to
describe the model behind the structure. People do not have to
subscribe to this model to use data in LDML, but they do need
to understand it so that the data can be correctly translated
into whatever model their implementation uses.
The first issue is basic:
what is a locale?
In this
model, a locale is an identifier (id) that refers to a set of
user preferences that tend to be shared across significant
swaths of the world. Traditionally, the data associated with
this id provides support for formatting and parsing of dates,
times, numbers, and currencies; for measurement units, for
sort-order (collation), plus translated names for time zones,
languages, countries, and scripts. The data can also include
support for text boundaries (character, word, line, and
sentence), text transformations (including transliterations),
and other services.
Locale data is not cast in stone: the data used on someone's
machine generally may reflect the US format, for example, but
preferences can typically set to override particular items,
such as setting the date format for 2002.03.15, or using metric
or Imperial measurement units. In the abstract, locales are
simply one of many sets of preferences that, say, a website may
want to remember for a particular user. Depending on the
application, it may want to also remember the user's time zone,
preferred currency, preferred character set, smoker/non-smoker
preference, meal preference (vegetarian, kosher, and so on),
music preference, religion, party affiliation, favorite
charity, and so on.
Locale data in a system may also change over time: country
boundaries change; governments (and currencies) come and go:
committees impose new standards; bugs are found and fixed in
the source data; and so on. Thus the data needs to be versioned
for stability over time.
In general terms, the locale id is a parameter that is
supplied to a particular service (date formatting, sorting,
spell-checking, and so on). The format in this document does
not attempt to represent all the data that could conceivably be
used by all possible services. Instead, it collects together
data that is in common use in systems and internationalization
libraries for basic services. The main difference among locales
is in terms of language; there may also be some differences
according to different countries or regions. However, the line
between
locales
and
languages
, as commonly used
in the industry, are rather fuzzy. Note also that the vast
majority of the locale data in CLDR is in fact language data;
all non-linguistic data is separated out into a separate tree.
For more information, see
Section 3.10 Language and Locale
IDs
We will speak of data as being "in locale X". That does not
imply that a locale
is
a collection of data; it is
simply shorthand for "the set of data associated with the
locale id X". Each individual piece of data is called a
resource
or
field
, and a tag indicating the key
of the resource is called a
resource tag.
3 Unicode Language
and Locale Identifiers
Unicode LDML uses stable identifiers based on [
BCP47
] for distinguishing among languages,
locales, regions, currencies, time zones, transforms, and so
on. There are many systems for identifiers for these entities.
The Unicode LDML identifiers may not match the identifiers used
on a particular target system. If so, some process of
identifier translation may be required when using LDML
data.
The BCP 47 extensions (-u- and -t-) are described in
Section 3.6
Unicode BCP 47 U
Extension
and
Section 3.7
Unicode BCP 47 T Extension
3.1 Unicode Language
Identifier
Unicode language identifier
has the following
structure (provided in EBNF (Perl-based)). The following table defines
syntactically well-formed identifiers: they are not necessarily
valid identifiers. For additional validity criteria, see the
links on the right.
EBNF
Validity / Comments
unicode_language_id
= "root"
| (unicode_language_subtag
(sep unicode_script_subtag)?
| unicode_script_subtag)
(sep unicode_region_subtag)?
(sep unicode_variant_subtag)* ;
"root" is treated as a special
unicode_language_subtag
unicode_language_subtag
= alpha{2,3} | alpha{5,8};
validity
latest-data
unicode_script_subtag
= alpha{4} ;
validity
latest-data
unicode_region_subtag
= (alpha{2} | digit{3}) ;
validity
latest-data
unicode_variant_subtag
= (alphanum{5,8}
| digit alphanum{3}) ;
validity
latest-data
sep
= [-_] ;
digit
= [0-9] ;
alpha
= [A-Z a-z] ;
alphanum
= [0-9 A-Z a-z] ;
The semantics of the various subtags is explained in
Section 3.4
Language
Identifier Field Definitions
; there are also direct
links from
unicode_language_subtag
etc. While theoretically the
unicode_language_subtag
may have more than 3 letters through the IANA registration
process, in practice that has not occurred. The
unicode_language_subtag
"und" may be omitted when there is a
unicode_script_subtag
; for
that reason
unicode_language_subtag
values with 4 letters are not permitted. However, such
unicode_language_id
values
are not intended for general interchange, because they are not
valid BCP 47 tags. Instead, they are intended for certain
protocols such as the identification of transliterators or font
ScriptLangTag values. For more information on language subtags with 4 letters, see
BCP 47 Language Tag to
Unicode BCP 47 Locale Identifier
For example, "en-US" (American English), "en_GB" (British
English), "es-419" (Latin American Spanish), and "uz-Cyrl"
(Uzbek in Cyrillic) are all valid Unicode language
identifiers.
3.2
Unicode Locale Identifier
Unicode locale identifier
is composed of a Unicode
language identifier plus (optional) locale extensions. It has
the following structure. The semantics of the U and T
extensions are explained in
Section 3.6
Unicode BCP 47 U Extension
and
Section 3.7
Unicode BCP 47 T
Extension
. Other extensions and private use extensions
are supported for pass-through. The following table defines
syntactically
well-formed
identifiers: they are not
necessarily
valid
identifiers. For additional validity
criteria, see the links on the right.
As is often the case, the complete syntactic constraints are not easily captured by ABNF, so there is a further condition: There cannot be more than one extension with the
same singleton (-a-, …, -t-, -u-, …). Note that the private use extension (-x-) must
come after all other extensions.
EBNF
Validity
unicode_locale_id
= unicode_language_id
extensions*
pu_extensions? ;
extensions
= unicode_locale_extensions
| transformed_extensions
| other_extensions ;
unicode_locale_extensions
= sep [uU]
((sep keyword)+
|(sep attribute)+ (sep keyword)*) ;
transformed_extensions
= sep [tT]
((sep tlang (sep tfield)*)
| (sep tfield)+) ;
pu_extensions
= sep [xX]
(sep alphanum{1,8})+ ;
other_extensions
= sep [alphanum-[tTuUxX]]
(sep alphanum{2,8})+ ;
keyword
(Also known as
uvalue
= key (sep type)? ;
key
(Also known as
ukey
= alphanum alpha ;
(Note that this is narrower than in [
RFC6067
], so that it is disjoint with tkey.)
validity
latest-data
type
(Also known as
uvalue
= alphanum{3,8}
(sep alphanum{3,8})* ;
validity
latest-data
attribute
= alphanum{3,8} ;
unicode_subdivision_id
unicode_region_subtag
unicode_subdivision_suffix ;
validity
latest-data
unicode_subdivision_suffix
= alphanum{1,4} ;
unicode_measure_unit
= alphanum{3,8}
(sep alphanum{3,8})* ;
validity
latest-data
tlang
= unicode_language_subtag
(sep unicode_script_subtag)?
(sep unicode_region_subtag)?
(sep unicode_variant_subtag)* ;
tfield
= tkey tvalue;
validity
latest-data
tkey
= alpha digit ;
tvalue
= (sep alphanum{3,8})+ ;
For historical reasons, this is called a Unicode locale
identifier. However, it really functions (with few exceptions)
as a
language
identifier, and accesses
language
-based data. Except where it
would be unclear, this document uses the term "locale" data
loosely to encompass both types of data: for more information,
see
Section 3.10 Language
and Locale IDs
As of the release of this specification, there were no
other_extensions defined. The other_extensions are present in
the syntax to allow implementations to preserve that
information.
As for terminology, the term
code
may also be used
instead of "subtag", and "territory" instead of "region". The
primary language subtag is also called the
base language
code
. For example, the base language code for "en-US"
(American English) is "en" (English). The
type
may also
be referred to as a
value
or
key-value
The identifiers can vary in case and in the separator
characters. The "-" and "_" separators are treated as
equivalent, although "-" is preferred.
All identifier field values are case-insensitive. Although
case distinctions do not carry any special meaning, an
implementation of LDML should use the casing recommendations in
BCP47
], especially when a Unicode locale
identifier is used for locale data exchange in software
protocols.
3.2.1 Canonical Unicode Locale Identifiers
unicode_locale_id
has
canonical syntax
when:
It starts with a language subtag (those beginning with a script subtag are only for specialized use)
Casing
Any script subtag is in title case (eg, Hant)
Any region subtag is in uppercase (eg, DE)
All other subtags are in lowercase (eg, en, fonipa)
Order
Any variants are in alphabetical order (eg, en-fonipa-scouse,
not en-scouse-fonipa)
Any extensions are in alphabetical order by their singleton
(eg, en-t-xxx-u-yyy, not en-u-yyy-t-xxx)
All attributes are sorted in alphabetical order.
All keywords and tfields are sorted by alphabetical order of their keys, within their respective extensions.
Any type or tfield value "true" is removed.
For example, the canonical form of
"en-u-foo-bar-nu-thai-ca-buddhist-kk-true" is
"en-u-bar-foo-ca-buddhist-kk-nu-thai". The attributes "foo" and
"bar" in this example are provided only for illustration; no
attribute subtags are defined by the current CLDR
specification.
Note:
The current version of CLDR data uses some
non-preferred
syntax
for backward compatibility. This might be
changed in future CLDR releases.
It uses uppercase letters for variant subtags, while the
preferred forms are all lowercase.
It uses "_" as the separator, while the preferred form of
the separator is "-".
It uses "root", while the preferred form is "und".
unicode_locale_id
is in
canonical form
when it has canonical syntax and contains no aliased subtags. A
unicode_locale_id
can be transformed into canonical form in the following way:
Use the bcp47 data to replace
keys, types, tfields,
and
tvalues
by their canonical forms. See
Section 3.6.4 U
Extension Data Files
) and
Section 3.7.1 T Extension Data Files
. The aliases are in the
alias
attribute value, while the canonical is in the
name
attribute value. For example,
Because of the following bcp47 data:
We get the following transformation:
en-u-ms-
imperial
⇒ en-u-ms-
uksystem
Replace aliases in the
unicode_language_id
and tlang (if any) using the following process:
If the language subtag matches the
type
attribute of a
languageAlias
element in
Supplemental
Data
, replace the language subtag with the
replacement
value.
If there are additional subtags in the
replacement
value, add them to the result, but
only if there is no corresponding subtag already in the
tag.
Five special deprecated grandfathered codes (such as
i-default
) are in type attributes, and are also replaced.
If the region subtag matches the
type
attribute of a
territoryAlias
element in
Supplemental Data
replace the language subtag with the
replacement
value, as follows:
If there is a single territory in the replacement,
use it.
If there are multiple territories:
Look up the most likely territory* for the base
language code (and script, if there is one).
If that likely territory is in the list, use
it.
Otherwise, use the first territory in the
list.
If any variant subtag matches the the
type
attribute of a
variantAlias
element in
Supplemental Data
replace the variant subtag with the
replacement
value.
The replacement may not be a variant subtag. In that case, the variant subtag is removed, and the other tag is substituted. For example, hy-FR-arevmda ⇒ hyw-FR
Replace aliases in special key values:
If there is an 'sd' or 'rg' key, replace any subdivision alias in its value in the same way, using
subdivisionAlias
data.
* Formally, replacement of multiple territories uses
Section 4.3
Likely Subtags
. However, there are a small number of cases of multiple territories, so the mappings can be precomputed. This results in a faster lookup with a very small subset of the likely subtags data.
unicode_locale_id
is
maximal
when the
unicode_language_id
and tlang (if any) have been transformed by the Add Likely Subtags operation in
Section 4.3
Likely Subtags
, excluding "und".
Example:
the maxmal form of ja-Kana-t-it is ja-Kana-JP-t-it-Latn-IT
Two
unicode_locale_ids
are
equivalent
when their maximal canonical forms are identical.
Example:
"IW-HEBR-u-ms-imperial" ~ "he-u-ms-uksystem"
The equivalence relationship may change over time, such as when subtags are deprecated or likely subtag mappings change. For example, if two countries were to merge, then various subtags would become deprecated. These kinds of changes are generally very infrequent.
3.3 BCP 47 Conformance
Unicode language and locale identifiers inherit the design
and the repertoire of subtags from [
BCP47
Language Tags. There are some extensions and restrictions made
for the use of the Unicode locale identifier in CLDR:
It does not allow for the full syntax of [
BCP47
]:
No extlang subtags are allowed (as in the BCP 47
canonical form, see BCP 47
Section
4.5
and
Section 3.1.7
No irregular BCP 47 grandfathered tags are allowed
(these are all deprecated in BCP 47)
A tag must not start with the subtag "x": thus a
privateuse
(eg x-abc) can only be after a
language subtag, like "und"
It allows for certain semantic additions and constraints:
Certain codes that are private-use in BCP-47 and ISO
are given semantics by LDML
Each macrolanguage has an identified primary
encompassed language, which is treated as an alias for
the macrolanguage, and thus is replaced when
canonicalizing (as allowed by BCP 47, see
Section
4.1.2
It allows certain syntax for backwards compatibility (not
BCP 47-compatible):
The "_" character for field separator characters, as
well as the "-" used in [
BCP47
(however, the canonical form is with "-")
The subtag "root" to indicate the generic locale used
as the parent of all languages in the CLDR data model
("und" can be used instead)
The language tag may begin with a script subtag
rather than a language subtag. This is specialized use
only, and not required for CLDR conformance.
There are thus two subtypes of Unicode locale
identifiers:
the term
Unicode CLDR locale identifier
applies
where the backwards compatibility syntax is used.
the term
Unicode BCP 47 locale identifier
applies otherwise. A
Unicode BCP 47 locale
identifier
is also a valid BCP 47 language tag.
3.3.1 BCP 47 Language Tag
Conversion
The different identifiers can be converted to one another as
described in this section.
BCP 47 Language Tag to
Unicode BCP 47 Locale Identifier
A valid [
BCP47
] language tag can be
converted to a valid Unicode BCP 47 locale identifier by
performing the following transformation.
Canonicalize the syntax of the language tag (afterwards, there will be
no extlang subtags) as per
3.2.1 Canonical Unicode Locale Identifiers
If the BCP 47 primary language subtag matches the
type
attribute of a
languageAlias
element in
Supplemental
Data
, replace the language subtag with the
replacement
value.
If there are additional subtags in the
replacement
value, add them to the result, but
only if there is no corresponding subtag already in the
tag.
Five special deprecated grandfathered codes (such as
i-default
) are in type attributes, and are also replaced.
Note:
there are currently no valid 4-letter primary language subtags. While it is extremely unlikely that BCP47 would ever register them, if so then
languageAlias
mappings will be supplied for them, mapping to defined CLDR language subtags (from the idStatus="reserved" set).
If the BCP 47 region subtag matches the
type
attribute of a
territoryAlias
element in
Supplemental Data
replace the language subtag with the
replacement
value, as follows:
If there is a single territory in the replacement,
use it.
If there are multiple territories:
Look up the most likely territory for the base
language code (and script, if there is one).
If that likely territory is in the list, use
it.
Otherwise, use the first territory in the
list.
If the tag is a deprecated grandfathered tag
that remains after step #1, prefix by "und-x-".
If the first subtag is "x", prefix by "und-".
The result is a Unicode BCP 47 locale identifier, in
canonical form. It is both a BCP 47 language tag and a Unicode
locale identifier. Because the process maps from all BCP 47
language tags into a subset of BCP 47 language tags, the format
changes are not reversible, much as a lowercase transformation
of the string “McGowan” is not reversible.
Examples
BCP 47 language tag
Unicode BCP 47 locale
identifier
Comments
en-US
en-US
no changes
iw-FX
he-FR
BCP 47 canonicalization [1]
cmn-TW
zh-TW
language alias [2]
zh-cmn-TW
zh-TW
BCP 47 canonicalization [1], then language alias
[2]
sr-CS
sr-RS
territory alias [3]
sh
sr-Latn
multiple replacement subtags [2.1]
sh-Cyrl
sr-Cyrl
no replacement with multiple replacement subtags [2.1
doesn't apply]
hy-SU
hy-AM
multiple territory values [3.2]
i-enochian
und-x-i-enochian
prefix any grandfathered tags with "und-x-" [4]
x-abc
und-x-abc
prefix with "und-", so that there is always a base
language subtag [5]
Unicode Locale
Identifier: CLDR to BCP 47
A Unicode CLDR locale identifier can be converted to a valid
BCP47
] language tag (which is also a
Unicode BCP 47 locale identifier) by performing the following
transformation.
Replace the "_" separators with "-"
Replace the special language identifier "root" with the
BCP 47 primary language tag "und"
Add an initial "und" primary language subtag if the first
subtag is a script.
Examples:
Unicode CLDR locale identifier
BCP 47 language tag
Comments
en_US
en-US
change separator [1]
de_DE_u_co_phonebk
de-DE-u-co-phonebk
change separator [1]
root
und
change to "und" [2]
root_u_cu_usd
und-u-cu-usd
change to "und" [1, 2]
Latn_DE
und-Latn-DE
add "und" [1, 3]
Unicode Locale
Identifier: BCP 47 to CLDR
A Unicode BCP 47 locale identifier can be transformed into a
Unicode CLDR locale identifier by performing the following
transformation.
the separator is changed to "_"
the primary language subtag "und" is replaced with "root"
if no script, region, or variant subtags are present.
Examples:
BCP 47 language tag
Unicode CLDR locale identifier
Comments
en-US
en_US
changes separator [1]
und
root
changes to "root", because no script, region, or
variant tag is present [2]
und-US
und_US
no change to "und", because a region subtag is present
[1]
und-u-cu-USD
root_u_cu_usd
changes to "root", because no script, region, or
variant tag is present [1, 2]
3.4 Language Identifier Field
Definitions
Unicode language and locale identifier field values are
provided in the following table. Note that some private-use BCP
47 field values are given specific meanings in CLDR. While
field values are based on [
BCP47
] subtag
values, their validity status in CLDR is specified by means of
machine-readable files in the
common/validity/
subdirectory, such as language.xml. For the format of those
files and more information, see
Section 3.11 Validity Data
Language Identifier
Field Definitions
Field
Valid values
unicode_language_subtag
(also known as a
Unicode base language
code)
Subtags in the language.xml file (see
Section 3.11
Validity Data
). These
are based on [
BCP47
] subtag values
marked as
Type: language
ISO 639-3 introduces the notion of "macrolanguages",
where certain ISO 639-1 or ISO 639-2 codes are given
broad semantics, and additional codes are given for the
narrower semantics. For backwards compatibility, Unicode
language identifiers retain use of the narrower semantics
for these codes. For example:
For
Use
Not
Standard Chinese (Mandarin)
zh
cmn
Standard Arabic
ar
arb
Standard Malay
ms
zsm
Standard Swahili
sw
swh
Standard Uzbek
uz
uzn
Standard Konkani
kok
knn
Northern Kurdish
ku
kmr
If a language subtag matches the type attribute of a
languageAlias element, then the replacement value is used
instead. For example, because "swh" occurs in
, "sw" must be used instead of
"swh". Thus Unicode language identifiers use "ar-EG" for
Standard Arabic (Egypt), not "arb-EG"; they use "zh-TW"
for Mandarin Chinese (Taiwan), not "cmn-TW".
The private use codes listed as
excluded
in
Section 3.5.3
Private Use Codes
will never be
given specific semantics in Unicode identifiers, and are
thus safe for use for other purposes by other
applications.
The CLDR provides data for normalizing language/locale
codes, including mapping overlong codes like "eng-840" or
"eng-USA" to the correct code "en-US"; see the
Aliases
Chart.
The following are special language subtags:
Name
Comment
mis
Uncoded languages
The content is in a language that doesn't yet
have an ISO 639 code.
mul
Multiple languages
The content contains more than one language or
text that is simultaneously in multiple languages
(such as brand names).
zxx
No linguistic content
The content is not in any particular languages
(such as images, symbols, etc.)
unicode_script_subtag
(also known as a
Unicode script code)
Subtags in the script.xml file (see
Section 3.11
Validity Data
). These
are based on [
BCP47
] subtag values
marked as
Type: script
In most cases the script is not necessary, since the
language is only customarily written in a single script.
Examples of cases where it is used are:
az_Arab
Azerbaijani in Arabic script
az_Cyrl
Azerbaijani in Cyrillic script
az_Latn
Azerbaijani in Latin script
zh_Hans
Chinese, in simplified script (=zh, zh-Hans,
zh-CN, zh-Hans-CN)
zh_Hant
Chinese, in traditional script
Unicode identifiers give specific semantics to certain
Unicode Script values. For more information, see also
UAX24
]:
Qaag
Zawgyi
Qaag is a special script code for
identifying the non-standard use of Myanmar
characters for display with the Zawgyi font. The
purpose of the code is to enable migration to
standard, interoperable use of Unicode by providing
an identifier for Zawgyi for tagging text,
applications, input methods, font tables,
transformations, and other mechanisms used for
migration.
Qaai
Inherited
deprecated
: the
canonicalized
form is Zinh
Zinh
Inherited
Zsye
Emoji Style
Prefer emoji style for characters
that have both text and emoji styles available.
Zsym
Text Style
Prefer text style for characters that
have both text and emoji styles available.
Zxxx
Unwritten
Indicates spoken or otherwise
unwritten content. For example:
Sample(s)
Description
uz
either written or spoken content
uz-Latn
or
uz-Arab
written-only content (particular script)
uz-Zyyy
written-only content (unspecified script)
uz-Zxxx
spoken-only content
uz-Latn, uz-Zxxx
both specific written and spoken content (using a
language list
Zyyy
Common
Zzzz
Unknown
The private use subtags listed as
excluded
in
Section 3.5.3
Private Use Codes
will never be
given specific semantics in Unicode identifiers, and are
thus safe for use for other purposes by other
applications.
unicode_region_subtag
(also known as a
Unicode region code,
or
Unicode territory code)
Subtags in the region.xml file (see
Section 3.11
Validity Data
). These
are based on [
BCP47
] subtag values
marked as
Type: region
Unicode identifiers give specific semantics to the
following subtags:
Name
Comment
ISO 3166-1 status
QO
Outlying Oceania
countries in Oceania [009] that do not have a
subcontinent
private use
QU
European Union
deprecated
: the
canonicalized
form is EU
private use
UK
United Kingdom
deprecated
: the
canonicalized
form is GB
exceptionally reserved
XA
Pseudo-Accents
special code indicating derived testing locale
with English + added accents and lengthened
private use
XB
Pseudo-Bidi
special code indicating derived testing locale
with forced RTL English
private use
XK
Kosovo
industry practice
private use
ZZ
Unknown or Invalid Territory
used in APIs or as replacement for invalid
code
private use
The private use subtags listed as
excluded
in
Section 3.5.3
Private Use Codes
will normally
never be given specific semantics in Unicode identifiers,
and are thus safe for use for other purposes by other
applications. However, LDML may follow widespread
industry practice in the use of some of these codes, such
as for XK.
The CLDR provides data for normalizing
territory/region codes, including mapping overlong codes
like "eng-840" or "eng-USA" to the correct code
"en-US".
Special Codes:
The territory code 'UK' has a special status in
ISO, and is used for the domain name instead of GB. It
is thus recognized by CLDR as being an alternate
(unnormalized) form of 'GB'.
The territory code '001' (the World) is used to
indicate a standardized form, such as "ar-001" for
Modern Standard Arabic.
unicode_variant_subtag
(also known as a
Unicode language variant
code)
Subtags in the variant.xml file (see
Section 3.11
Validity Data
). These
are based on [
BCP47
] subtag values
marked as
Type: variant
CLDR provides data for normalizing variant codes.
About handling of the "POSIX" variant see
Section
3.8.2,
Legacy
Variants
Examples:
en
fr_BE
zh-Hant-HK
Deprecated
codes—such as QU above—are valid, but
strongly discouraged.
A locale that only has a language subtag (and optionally a
script subtag) is called a
language locale
; one with
both language and territory subtag is called a
territory
locale
(or
country locale
).
3.5 Special Codes
3.5.1 Unknown or Invalid
Identifiers
The following identifiers are used to indicate an unknown or
invalid code in Unicode language and locale identifiers. For
Unicode identifiers, the region code uses a private use ISO
3166 code, and Time Zone code uses an additional code; the
others are defined by the relevant standards. When these codes
are used in APIs connected with Unicode identifiers, the
meaning is that either there was no identifier available, or
that at some point an input identifier value was determined to
be invalid or ill-formed.
Code Type
Value
Description in Referenced Standards
Language
und
Undetermined language, also used for “root”
Script
Zzzz
Code for uncoded script, Unknown [
UAX24
Region
ZZ
Unknown or Invalid Territory
Currency
XXX
The codes assigned for transactions where no currency
is involved
Time Zone
unk
Unknown or Invalid Time Zone
Subdivision
zzzz
Unknown or Invalid Subdivision
When only the script or region are known, then a locale ID
will use "und" as the language subtag portion. Thus the locale
tag "und_Grek" represents the Greek script; "und_US" represents
the US territory.
3.5.2 Numeric Codes
For region codes, ISO and the UN establish a mapping to
three-letter codes and numeric codes. However, this does not
extend to the private use codes, which are the codes 900-999
(total: 100), and AAA, QMA-QZZ, XAA-XZZ, and ZZZ (total: 1092).
Unicode identifiers supply a standard mapping to these: for the
numeric codes, it uses the top of the numeric private use
range; for the 3-letter codes it doubles the final letter.
These are the resulting mappings for all of the private use
region codes:
Region
UN/ISO Numeric
ISO 3-Letter
AA
958
AAA
QM..QZ
959..972
QMM..QZZ
XA..XZ
973..998
XAA..XZZ
ZZ
999
ZZZ
For script codes, ISO 15924 supplies a mapping (however, the
numeric codes are not in common use):
Script
Numeric
Qaaa..Qabx
900..949
3.5.3
Private Use Codes
Private use codes fall into three groups.
defined:
those that are given particular
semantics currently in CLDR
reserved:
those that may be given
particular semantics in future versions of CLDR
excluded:
those that will never be given
particular CLDR semantics in the future, and thus can
normally be used by applications without worrying about
collisions. However, CLDR may follow widespread industry
practice in the use of some of these codes, such as for XA,
XB, and XK.
Private Use Codes in CLDR
category
status
codes
base language
defined
none
reserved
qaa..qfy
excluded
qfz..qtz
script
defined
Qaai (obsolete), Qaag
reserved
Qaaa..Qaaf Qaah Qaaj..Qaap
excluded
Qaaq..Qabx
region
defined
QO, QU, UK, XA, XB, XK, ZZ
reserved
AA QM..QN QP..QT QV..QZ
excluded
XC..XJ, XL..XZ
timezone
defined
IANA: Etc/Unknown
bcp47: as listed in bcp47/timezone.xml
reserved
bcp47: all non-5 letter codes not starting with x
excluded
bcp47: all non-5 letter codes starting with x
See also
Section 3.5.1
Unknown or Invalid
Identifiers
3.6 Unicode BCP 47 U
Extension
BCP47
] Language Tags provides a
mechanism for extending language tags for use in various
applications by extension subtags. Each extension subtag is
identified by a single alphanumeric character subtag assigned
by IANA.
The Unicode Consortium has registered and is the maintaining
authority for two BCP 47 language tag extensions: the extension
'u' for Unicode locale extension [
RFC6067
] and extension 't' for transformed
content [
RFC6497
]. The Unicode BCP 47
extension data defines the complete list of valid subtags.
These subtags are all in lowercase (that is the canonical
casing for these subtags), however, subtags are
case-insensitive and casing does not carry any specific
meaning. All subtags within the Unicode extensions are
alphanumeric characters in length of two to eight that meet the
rule
extension
in the [
BCP47
The -u- Extension.
The syntax of 'u'
extension subtags is defined by the rule
unicode_locale_extensions
in
Section 3.2 Unicode locale
identifier
, except the separator of subtags
sep
must be always hyphen '-' when the extension
is used as a part of BCP 47 language tag.
A 'u' extension may contain multiple
attribute
s or
keyword
s as defined in
Section 3.2 Unicode locale
identifier
. The canonical syntax is defined as in
3.2.1 Canonical Unicode Locale Identifiers
See also
Unicode
Extensions for BCP 47
on the CLDR site.
3.6.1 Key And Type
Definitions
The following chart contains a set of U extension key values
that are currently available, with a description or sampling of
the U extension type values. Each category is associated with
an XML file in the bcp47 directory.
For the complete list of valid keys and types defined for
Unicode locale extensions, see
Section 3.6.4 U
Extension Data Files
. For information on the process for
adding new
key
type
, see [
LocaleProject
].
Most type values are represented by a single subtag in the
current version of CLDR. There are exceptions, such as types
used for key "ca" (calendar) and "kr" (collation reordering).
If the type is not included, then the type value "true" is
assumed. Note that the default for key with a possible "true"
value is often "false", but may not always be. Note also that
"true"/"True" is not a valid script code, since
the ISO 15924
Registration Authority has exceptionally reserved it
, which
means that it will not be assigned for any purpose.
The BCP 47 form for keys and types is the canonical form,
and recommended. Other aliases are included for backwards
compatibility.
Key/Type Definitions
key
(old key name)
key description
example type
(old type name)
type description
Unicode Calendar Identifier
defines a type of calendar. The valid values are those
name
attribute values in the
type
elements of key name="ca" in bcp47/
calendar.xml
"ca"
(calendar)
Calendar algorithm
(For information on the calendar algorithms associated
with the data used with these, see [
Calendars
].)
"buddhist"
Thai Buddhist calendar (same as Gregorian except for
the year)
"chinese"
Traditional Chinese calendar
"gregory"
(gregorian)
Gregorian calendar
"islamic"
Islamic calendar
"islamic-civil"
Islamic calendar, tabular (intercalary years
[2,5,7,10,13,16,18,21,24,26,29] - civil epoch)
"islamic-umalqura"
Islamic calendar, Umm al-Qura
Note:
Some calendar types are
represented by two subtags. In such cases, the first subtag
specifies a generic calendar type and the second subtag
specifies a calendar algorithm variant. The CLDR uses
generic calendar types (single subtag types) for tagging
data when calendar algorithm variations within a generic
calendar type are irrelevant. For example, type "islamic"
is used for specifying Islamic calendar formatting data for
all Islamic calendar types, including "islamic-civil" and
"islamic-umalqura".
Unicode Currency Format
Identifier
defines a style for currency formatting. The
valid values are those
name
attribute values in
the
type
elements of key name="cf" in
bcp47/
currency.xml
"cf"
Currency Format style
"standard"
Negative numbers use the minusSign symbol (the
default).
"account"
Negative numbers use parentheses or equivalent.
Unicode Collation
Identifier
defines a type of collation (sort order).
The valid values are those
name
attribute values
in the
type
elements of bcp47/
collation.xml
For information on each collation
setting parameter, from
ka
to
vt
, see
Setting
Options
"co"
(collation)
Collation type
"standard"
The default ordering for each language. For root it is
based on the [
DUCET
] (Default Unicode
Collation Element Table): see
Root
Collation
. Each other locale is based on that,
except for appropriate modifications to certain characters
for that language.
"search"
A special collation type dedicated for string search—it
is not used to determine the relative order of two strings,
but only to determine whether they should be considered
equivalent for the specified strength, using the string
search matching rules appropriate for the language.
Compared to the normal collator for the language, this may
add or remove primary equivalences, may make additional
characters ignorable or change secondary equivalences, and
may modify contractions to allow matching within them,
depending on the desired behavior. For example, in Czech,
the distinction between ‘a’ and ‘á’ is secondary for normal
collation, but primary for search; a search for ‘a’ should
never match ‘á’ and vice versa. A search collator is
normally used with strength set to PRIMARY or SECONDARY
(should be SECONDARY if using “asymmetric” search as
described in the [
UCA
section Asymmetric Search). The search collator in root
supplies matching rules that are appropriate for most
languages (and which are different than the root collation
behavior); language-specific search collators may be
provided to override the matching rules for a given
language as necessary.
Other keywords provide additional choices for certain
locales;
they only have effect in certain
locales.
"phonetic"
Requests a phonetic variant if available, where text is
sorted based on pronunciation. It may interleave different
scripts, if multiple scripts are in common use.
"pinyin"
Pinyin ordering for Latin and for CJK characters; that
is, an ordering for CJK characters based on a
character-by-character transliteration into a pinyin. (used
in Chinese)
"reformed"
Reformed collation (such as in Swedish)
"searchjl"
Special collation type for a modified string search in
which a pattern consisting of a sequence of Hangul initial
consonants (jamo lead consonants) will match a sequence of
Hangul syllable characters whose initial consonants match
the pattern. The jamo lead consonants can be represented
using conjoining or compatibility jamo. This search
collator is best used at SECONDARY strength with an
"asymmetric" search as described in the [
UCA
section Asymmetric Search and obtained, for example, using
ICU4C's usearch facility with attribute
USEARCH_ELEMENT_COMPARISON set to value
USEARCH_PATTERN_BASE_WEIGHT_IS_WILDCARD; this ensures that
a full Hangul syllable in the search pattern will only
match the same syllable in the searched text (instead of
matching any syllable with the same initial consonant),
while a Hangul initial consonant in the search pattern will
match any Hangul syllable in the searched text with the
same initial consonant.
Unicode Currency Identifier
defines a type of currency. The valid values are those
name
attribute values in the
type
elements of key name="cu" in bcp47/
currency.xml
"cu"
(currency)
Currency type
ISO 4217 code,
plus others in common use
Codes consisting of 3 ASCII letters that are or have
been valid in ISO 4217, plus certain additional codes
that are or have been in common use. The list of
countries and time periods associated with each currency
value is available in
Supplemental
Currency Data
, plus the default number of
decimals.
The XXX code is given a broader interpretation as
Unknown or Invalid Currency
Unicode Emoji
Presentation Style Identifier
specifies a request for
the preferred emoji presentation style. This can be used as
part of the value for an HTML lang attribute, for example
. The
valid values are those
name
attribute values in
the
type
elements of key name="em" in
bcp47/
variant.xml
"em"
Emoji presentation style
"emoji"
Use an emoji presentation for emoji characters if
possible.
"text"
Use a text presentation for emoji characters if
possible.
"default"
Use the default presentation for emoji characters as
specified in UTR #51 Section 4,
Presentation
Style
Unicode First Day
Identifier
defines the preferred first day of the week
for calendar display. Specifying "fw" in a locale
identifier overrides the default value specified by
supplemental week data (see Part 4 Dates, section 4.3
Week Data
). The
valid values are those
name
attribute values in
the
type
elements of key name="fw" in
bcp47/
calendar.xml
"fw"
First day of week
"sun"
Sunday
"mon"
Monday
"sat"
Saturday
Unicode Hour Cycle
Identifier
defines the preferred time cycle. Specifying
"hc" in a locale identifier overrides the the default value
specified by supplemental time data (see Part 4 Dates,
section 4.4
Time
Data
). The valid values are those
name
attribute values in the
type
elements of key
name="hc" in bcp47/
calendar.xml
"hc"
Hour cycle
"h12"
Hour system using 1–12; corresponds to 'h' in
patterns
"h23"
Hour system using 0–23; corresponds to 'H' in
patterns
"h11"
Hour system using 0–11; corresponds to 'K' in
patterns
"h24"
Hour system using 1–24; corresponds to 'k' in
pattern
Unicode Line Break Style
Identifier
defines a preferred line break style
corresponding to the CSS level 3
line-break
option
. Specifying "lb" in a locale identifier
overrides the locale‘s default style (which may correspond
to "normal" or "strict"). The valid values are those
name
attribute values in the
type
elements of key name="lb" in bcp47/
segmentation.xml
"lb"
Line break style
"strict"
CSS level 3 line-break=strict, e.g. treat CJ as NS
"normal"
CSS level 3 line-break=normal, e.g. treat CJ as ID,
break before hyphens for ja,zh
"loose"
CSS lev 3 line-break=loose
Unicode Line Break Word
Identifier
defines preferred line break word handling
behavior corresponding to the CSS level 3
word-break
option
. The valid values are those
name
attribute values in the
type
elements of key
name="lw" in bcp47/
segmentation.xml
"lw"
Line break word handling
"normal"
CSS level 3 word-break=normal, normal script/language
behavior for midword breaks
"breakall"
CSS level 3 word-break=break-all, allow midword breaks
unless forbidden by lb setting
"keepall"
CSS level 3 word-break=keep-all, prohibit midword
breaks except for dictionary breaks
Unicode Measurement
System Identifier
defines a preferred measurement
system. Specifying "ms" in a locale identifier overrides
the default value specified by supplemental measurement
system data (see Part 2 General, section 5
Measurement
System Data
). The valid values are those
name
attribute values in the
type
elements of key
name="ms" in bcp47/
measure.xml
"ms"
Measurement system
"metric"
Metric System
"ussystem"
US System of measurement: feet, pints, etc.; pints are
16oz
"uksystem"
UK System of measurement: feet, pints, etc.; pints are
20oz
Unicode Number System
Identifier
defines a type of number system. The valid
values are those
name
attribute values in the
type
elements of bcp47/
number.xml
"nu"
(numbers)
Numbering system
Unicode script subtag
Four-letter types indicating the primary numbering
system for the corresponding script represented in
Unicode. Unless otherwise specified, it is a decimal
numbering system using digits [:GeneralCategory=Nd:]. For
example, "latn" refers to the ASCII / Western digits 0-9,
while "taml" is an algorithmic (non-decimal) numbering
system. (The code "tamldec" is indicates the "modern
Tamil decimal digits".)
For more information, see
Numbering
Systems
"arabext"
Extended Arabic-Indic digits ("arab" means the base
Arabic-Indic digits)
"armnlow"
Armenian lowercase numerals
"roman"
Roman numerals
"romanlow"
Roman lowercase numerals
"tamldec"
Modern Tamil decimal digits
Region Override
specifies an alternate region to use for obtaining certain
region-specific default values (those specified by the
element), instead of using the region specified by the
unicode_region_subtag
in the Unicode Language Identifier (or inferred from the
unicode_language_subtag
).
"rg"
Region Override
"uszzzz"
The value is a
unicode_subdivision_id
of type “unknown” or “regular”; this consists of a
unicode_region_subtag
for a
regular region (not a macroregion), suffixed
either by “zzzz” (case is not
significant) to designate the region
as a whole, or by a unicode_subdivision_suffix to provide
more specificity. For example, “en-GB-u-rg-uszzzz”
represents a locale for British English but with
region-specific defaults set to US for items such as
default currency, default calendar and week data, default
time cycle, and default measurement system and unit
preferences.
Unicode Subdivision
Identifier
defines a regional subdivision used for
locales. The valid values are based on the
subdivisionContainment
element as described in
Section
3.6.5
Subdivision Codes
"sd"
Regional Subdivision
"gbsct"
unicode_subdivision_id
, which
is a
unicode_region_subtag
concatenated with a unicode_subdivision_suffix.
For example,
gbsct
is “gb”+“sct” (where sct
represents the subdivision code for Scotland). Thus
“en-GB-u-sd-gbsct” represents the language variant “English
as used in Scotland”. And both “en-u-sd-usca” and
“en-US-u-sd-usca” represent “English as used in
California”. See
3.6.5 Subdivision
Codes
Unicode
Sentence Break Suppressions Identifier
defines a set of
data to be used for suppressing certain sentence breaks
that would otherwise be found by UAX #14 rules. The valid
values are those
name
attribute values in the
type
elements of key name="ss" in bcp47/
segmentation.xml
"ss"
Sentence break suppressions
"none"
Don’t use sentence break suppressions data (the
default).
"standard"
Use sentence break suppressions data of type
"standard"
Unicode Timezone Identifier
defines a timezone. The valid values are those name
attribute values in the
type
elements of
bcp47/
timezone.xml
"tz"
(timezone)
Time zone
Unicode short time zone IDs
Short identifiers defined in terms of a TZ time zone
database [
Olson
] identifier in the
file common/bcp47/timezone.xml file, plus a few extra
values.
For more information, see
Section 3.7.1.2 Time Zone
Identifiers
CLDR provides data for normalizing timezone codes.
Unicode Variant
Identifier
defines a special variant used for locales.
The valid values are those name attribute values in the
type
elements of bcp47/
variant.xml
"va"
Common variant type
"posix"
POSIX style locale variant. About handling of the
"POSIX" variant see
Section 3.8.2,
Legacy Variants
For more information on the allowed keys and types, see the
specific elements below, and
Section 3.6.4 U
Extension Data Files
Additional keys or types might be added in future versions.
Implementations of LDML should be robust to handle any
syntactically valid key or type values.
3.6.2 Numbering System Data
LDML supports multiple numbering systems. The identifiers
for those numbering systems are defined in the file
bcp47/number.xml
. For example, for the 'trunk'
version of the data see
bcp47/number.xml
Details about those numbering systems are defined in
supplemental/numberingSystems.xml
. For
example, for the 'trunk' version of the data see
supplemental/numberingSystems.xml
LDML makes certain stability guarantees on this
data:
Like other BCP 47 identifiers, once a numeric identifier
is added to
bcp47/number.xml
or
numberingSystems.xml
, it will never be
removed from either of those files.
If an identifier has type="numeric" in
numberingSystems.xml, then
It is a decimal, positional numbering system with an
attribute digits=X, where X is a string with the 10
digits in order used by the numbering system.
The values of the type and digits will never
change.
3.6.3 Time
Zone Identifiers
LDML inherits time zone IDs from the tz database [
Olson
]. Because these IDs from the tz database do
not satisfy the BCP 47 language subtag syntax requirements,
CLDR defines short identifiers for the use in the Unicode
locale extension. The short identifiers are defined in the file
common/bcp47/timezone.xml
The short identifiers use UN/LOCODE [
LOCODE
] (excluding a space character) codes where
possible. For example, the short identifier for
"America/Los_Angeles" is "uslax" (the LOCODE for Los Angeles,
US is "US LAX"). Identifiers of length not equal to 5 are used
where there is no corresponding UN/LOCODE, such as "usnavajo"
for "America/Shiprock", or "utcw01" for "Etc/GMT+1", so that
they do not overlap with future UN/LOCODE.
Although the first two letters of a short identifier may
match an ISO 3166 two-letter country code, a user should not
assume that the time zone belongs to the country. The first two
letters in an identifier of length not equal to 5 has no
meaning. Also, the identifiers are stabilized, meaning that
they will not change no matter what changes happen in the base
standard. So if Hawaii leaves the US and joins Canada as a new
province, the short time zone identifier "ushnl" would not
change in CLDR even if the UN/LOCODE changes to "cahnl" or
something else.
There is a special code "unk" for an Unknown or Invalid time
zone. This can be expressed in the tz database style ID
"Etc/Unknown", although it is not defined in the tz
database.
Stability of Time Zone Identifiers
Although the short time zone identifiers are guaranteed to
be stable, the preferred IDs in the tz database (as those found
in
zone.tab
file) might be changed time to
time. For example, "Asia/Culcutta" was replaced with
"Asia/Kolkata" and moved to
backward
file in
the tz database. CLDR contains locale data using a time zone ID
from the tz database as the key, stability of the IDs is
cirtical.
To maintain the stability of "long" IDs (for those inherited
from the tz database), a special rule applied to the
alias
attribute in the
the first "long" ID is the CLDR canonical "long" time zone
ID.
For example:
Above
"inccu" (for the use in the Unicode locale extension),
corresponding
CLDR canonical "long" ID
"Asia/Culcutta", and an alias "Asia/Kolkata".
3.6.4 U Extension Data
Files
The 'u' extension data is stored in multiple XML files
located under common/bcp47 directory in CLDR. Each file
contains the locale extension key/type values and their
backward compatibility mappings appropriate for a particular
domain.
common/bcp47/collation.xml
contains key/type values for
collation, including optional collation parameters and valid
type values for each key.
The 't' extension data is stored in
common/bcp47/transform.xml
any) #IMPLIED >
"false">
The extension attribute in
BCP 47 language tag extension type. The default value of the
extension attribute is "u" (Unicode locale extension). The
In the Unicode locale extension 'u' and 't' data files, the
common attributes for the
name
The key or type name used by Unicode locale extension
with
'u' extension
syntax
or the 't' extensions syntax. When
alias
below is absent, this name can be also used with the old
style
"@key=type"
syntax
Most type names are
literal type names
which match exactly the same value. All of these have at
least one lowercase letter, such as "buddhist". There are a
small number of
indirect type names
, such
as "RG_KEY_VALUE". These have no lowercase letters. The
interpretation of each one is listed below.
CODEPOINTS
The type name
"CODEPOINTS"
is reserved
for a variable representing Unicode code point(s). The
syntax is:
EBNF
codepoints
= codepoint (sep codepoint)?
codepoint
= [0-9 A-F a-f]{4,6}
In addition, no codepoint may exceed 10FFFF. For
example, "00A0", "300b", "10D40C" and "00C1-00E1" are
valid, but "A0", "U060C" and "110000" are not.
In the current version of CLDR, the type "CODEPOINTS" is
only used for the deprecated locale extension key "vt"
(variableTop). The subtags forming the type for "vt"
represent an arbitrary string of characters. There is no
formal limit in the number of characters, although
practically anything above 1 will be rare, and anything
longer than 4 might be useless. Repetition is allowed, for
example, 0061-0061 ("aa") is a Valid type value for "vt",
since the sequence may be a collating element. Order is
vital: 0061-0062 ("ab") is different than 0062-0061 ("ba").
Note that for variableTop any character sequence must be a
contraction which yields exactly one primary weight.
For example,
en-u-vt-00A4
: this indicates
English, with any characters sorting at or below " ¤" (at
a primary level) considered Variable.
By default in UCA, variable characters are ignored in
sorting at a primary, secondary, and tertiary level. But in
CLDR, they are not ignorable by default. For more
information, see
Collation: Section
3.3
Setting Options
REORDER_CODE
The type name
"REORDER_CODE"
is
reserved for reordering block names (e.g. "latn", "digit"
and "others") defined in the
Root
Collation
. The type "REORDER_CODE" is used for
locale extension key "kr" (colReorder). The value of type
for "kr" is represented by one or more reordering block
names such as "latn-digit". For more information, see
Collation:
Section 3.12
Collation Reordering
RG_KEY_VALUE
The type name
"RG_KEY_VALUE"
is
reserved for region codes in the format required by the
"rg" key; this is a subdivision
code with idStatus='unknown' or 'regular' from the
idValidity data in common/validity/subdivision.xml.
SUBDIVISION_CODE
The type name
"SUBDIVISION_CODE"
is
reserved for subdivision codes in the format required by
the "sd" key; this is a subdivision code from the
idValidity data in common/validity/subdivision.xml,
excluding those with idStatus='unknown'. Codes with
idStatus='deprecated' should not be generated, and those
with idStatus='private_use' are only to be used with prior
agreement.
PRIVATE_USE
The type name
"PRIVATE_USE"
is reserved
for private use types. A valid type value is composed of
one or more subtags separated by hyphens and each subtag
consists of three to eight ASCII alphanumeric characters.
In the current version of CLDR,
"PRIVATE_USE"
is only used for transform
extension "x0".
valueType
The valueType attribute indicates how many subtags are
valid for a given key:
single
Either exactly one type value, or no type value
(but only if the value of "true" would be valid).
This is the default if no valueType attribute is
present.
incremental
Multiple type values are allowed, but only if a
prefix is also present, and the sequence is
explicitly listed. Each successive type value
indicates a refinement of its prefix. For
example:
Thus
ca-islamic-umalqura
is valid. However,
ca-gregory-japanese
is not valid, because
"gregory-japanese" is not listed as a type.
multiple
Multiple type values are allowed, but each may
only occur once. For example:
any
Any number of type values are allowed, with none
of the above restrictions. For example:
description
The description of the key, type or attribute element.
There is also some informative text about certain keys and
types in the Section 3.5
Key And Type
Definitions
deprecated
The deprecation status of the key, type or attribute
element. The value "true" indicates the element is
deprecated and no longer used in the version of CLDR. The
default value is "false".
preferred
The preferred value of the deprecated key, type or
attribute element. When a key, type or attribute element is
deprecated, this attribute is used for specifying a new
canonical form if available.
alias
(Not applicable to
The BCP 47 form is the canonical form, and recommended.
Other aliases are included only for backwards
compatibility.
Example:
description="Phonebook
style ordering (such as in German)"/>
The
preferred term, and the only one to be used in BCP 47, is
the name: in this example, "phonebk".
The alias is a key or type name used by Unicode locale
extensions with the old
"@key=type" syntax
. The
attribute value for type may contain multiple names
delimited by ASCII space characters. Of those aliases, the
first name is the preferred value.
since
The version of CLDR in which this key or type was
introduced. Absence of this attribute value implies the key
or type was available in CLDR 1.7.2.
Note: There are no values defined for the locale
extension attribute in the current CLDR release.
For example,
...
...
The data above indicates:
type "pinyin" is valid for key "co", thus "u-co-pinyin"
is a valid Unicode locale extension.
type "pinyin" is not valid for key "ka", thus
"u-ka-pinyin" is not a valid Unicode locale extension.
type "pinyin" has no
alias
, so
"zh@collation=pinyin" is a valid Unicode locale identifier
according to the old syntax.
type "noignore" has an alias attribute, so
"en@colAlternate=noignore" is not a valid Unicode locale
identifier according to the old syntax.
type "aumel" is valid for key "tz", supported by CLDR
1.7.2 (default value) or later versions.
type "aumqi" is valid for key "tz", supported by CLDR
1.8.1 or later versions.
It is strongly recommended that all API methods accept all
possible aliases for keywords and types, but generate the
canonical form. For example, "ar-u-ca-islamicc" would be
equivalent to "ar-u-ca-islamic-civil" on input, but the latter
should be output. The one exception is where an alias would
only be well-formed with the old syntax, such as "gregorian"
(for "gregory").
3.6.5 Subdivision Codes
The subdivision codes designate a subdivision of a country
or region. They are called various names, such as a
state
in the United States, or a
province
in
Canada. The codes in CLDR are based on ISO 3166-2 subdivision
codes. The ISO codes have a region code followed by a hyphen,
then a suffix consisting of 1..3 ASCII letters or digits.
The CLDR codes are designed to work in a
unicode_locale_id
(BCP47), and are
thus all lowercase, with no hyphen. For example, the following
are valid, and mean “English as used in California, USA”.
en-u-sd-
usca
en-US-u-sd-
usca
CLDR has additional subdivision codes. These may start with
a 3-digit region code or use a suffix of 4 ASCII letters or
digits, so they will not collide with the ISO codes.
Subdivision codes for unknown values are the region code plus
"zzzz", such as "uszzzz" for an unknown subdivision of the US.
Other codes may be added for stability.
Like BCP 47, CLDR requires stable codes, which are not
guaranteed for ISO 3166-2 (nor have the ISO 3166-2 codes been
stable in the past). If an ISO 3166-2 code is removed, it
remains valid (though marked as deprecated) in CLDR. If an ICU
3166-2 code is reused (for the same region), then CLDR will
define a new equivalent code using these a 4-character
suffixes.
3.6.5.1
Validity
unicode_subdivision_id
is only
valid when it is present in the subdivision.xml file as
described in
Section 3.11
Validity
Data
. The data is in a compressed form, and thus needs
to be expanded before such a test is made.
Examples:
usca
is valid — there is an
id
element
…
ussct
is invalid — there is no
id
element
…
If a
unicode_locale_id
contains both a
unicode_region_subtag
and a
unicode_subdivision_id
it is only valid if the
unicode_subdivision_id
starts
with the
unicode_region_subtag
(case-insensitively).
It is recommended that a
unicode_locale_id
contain a
unicode_region_subtag
if it
contains a
unicode_subdivision_id
and the
region would not be added by adding likely subtags. That
produces better behavior if the
unicode_subdivision_id
is ignored
by an implementation or if the language tag is truncated.
Examples:
en-
US
-u-sd-
us
ca is
valid — the region "US" matches the first part of "usca"
en-u-sd-
us
ca is valid — it still works
after adding likely subtags.
en-
CA
-u-sd-
gb
sct is
invalid — the region "CA" does not match the first part of
"gbsct". An implementation should disregard the subdivision
id (or return an error).
en-u-sd-
gb
sct is valid but not
recommended — an implementation that ignores the
unicode_subdivision_id
can get
the wrong fallback behavior, or could add likely subtags and
get the invalid
en
-Latn-US
-u-sd-
gb
sct
In version 28.0, the subdivisions in the validity files used
the ISO format, uppercase with a hyphen separating two
components, instead of the BCP 47 format.
3.7 Unicode BCP 47 T Extension
The Unicode Consortium has registered and is the maintaining
authority for two BCP 47 language tag extensions: the extension
'u' for Unicode locale extension [
RFC6067
] and extension 't' for transformed
content [
RFC6497
]. The Unicode BCP 47
extension data defines the complete list of valid subtags.
While the title of the RFC is “Transformed Content”, the
abstract makes it clear that the scope is broader than the term
"transformed" might indicate to a casual
reader: “including content that has been transliterated,
transcribed, or translated, or
in some other way
influenced by the source. It also provides for additional
information used for identification.
The -t- Extension.
The syntax of 't'
extension subtags is defined by the rule
unicode_locale_extensions
in
Section 3.2 Unicode locale
identifier
, except the separator of subtags
sep
must be always hyphen '-' when the extension
is used as a part of BCP 47 language tag. For information about
the registration process, meaning, and usage of the 't'
extension, see [
RFC6497
].
These subtags are all in lowercase (that is the canonical
casing for these subtags), however, subtags are
case-insensitive and casing does not carry any specific
meaning. All subtags within the Unicode extensions are
alphanumeric characters in length of two to eight that meet the
rule
extension
in the [
BCP47
].
The following keys are defined for the -t- extension:
Keys
Description
Values in latest release
m0
Transform extension mechanism:
to
reference an authority or rules for a type of
transformation
transform.xml
s0, d0
Transform source/destination:
for
non-languages/scripts, such as fullwidth-halfwidth
conversion.
transform-destination.xml
i0
Input Method Engine transform:
Used
to indicate an input method transformation, such as one
used by a client-side input method. The first subfield in
a sequence would typically be a 'platform' or vendor
designation.
transform_ime.xml
k0
Keyboard transform:
Used to indicate
a keyboard transformation, such as one used by a
client-side virtual keyboard. The first subfield in a
sequence would typically be a 'platform' designation,
representing the platform that the keyboard is intended
for. The keyboard might or might not correspond to a
keyboard mapping shipped by the vendor for the platform.
One or more subsequent fields may occur, but are only
added where needed to distinguish from others.
transform_keyboard.xml
t0
Machine Translation:
Used to
indicate content that has been machine translated, or a
request for a particular type of machine translation of
content. The first subfield in a sequence would typically
be a 'platform' or vendor designation.
transform_mt.xml
h0
Hybrid Locale Identifiers:
h0 with
the value 'hybrid' indicates that the -t- value is a
language that is mixed into the main language tag to form
a hybrid. For more information, and examples, see
Section 3.10.2
Hybrid Locale
Identifiers
transform_hybrid.xml
x0
Private use transform
transform_private_use.xml
3.7.1 T Extension Data
Files
The overall structure of the data files is the similar to
the U Extension, with the following exceptions.
In the transformed content 't' data file, the name attribute
in a
subtag. The name attribute in an enclosed
defines a valid field subtag for the field separator subtag.
For example:
since="21"/>
The data above indicates:
"m0" is a valid field separator for the transformed
content extension 't'.
field subtag "ungegn" is valid for field separator
"m0".
field subtag "ungegn" was introduced in CLDR 21.
The attributes are:
name
The name of the mechanism, limited to 3-8 characters (or
sequences of them). Any indirect type names are listed in
3.6.4
Extension Data Files
description
A description of the name, with all and only that
information necessary to distinguish one name from | American
Library others with which it might be confused. Descriptions
are not intended to provide general background
information.
since
Indicates the first version of CLDR where the name
appears. (Required for new items.)
alias
Alternative name, not limited in number of characters.
Aliases are intended for compatibility, not to provide all
possible alternate names or designations.
(Optional)
For information about the registration process, meaning, and
usage of the 't' extension, see [
RFC6497
].
3.8 Compatibility with
Older Identifiers
LDML version before 1.7.2 used slightly different syntax for
variant subtags and locale extensions. Implementations of LDML
may provide backward compatible identifier support as described
in following sections.
3.8.1 Old Locale Extension
Syntax
LDML 1.7 or older specification used different syntax for
representing unicode locale extensions. The previous definition
of Unicode locale extensions had the following structure:
EBNF
old_unicode_locale_extensions
= "@" old_key "=" old_type
(";" old_key "=" old_type)*
The new specification mandates keys to be two alphanumeric
characters and types to be three to eight alphanumeric
characters. As the result, new codes were assigned to all
existing keys and some types. For example, a new key "co"
replaced the previous key "collation", a new type "phonebk"
replaced the previous type "phonebook". However, the existing
collation type "big5han" already satisfied the new requirement,
so no new type code was assigned to the type. All new keys and
types introduced after LDML 1.7 satisfy the new requirement, so
they do not have aliases dedicated for the old syntax, except
time zone types. The conversion between old types and new types
can be done regardless of key, with one known exception (old
type "traditional" is mapped to new type "trad" for collation
and "traditio" for numbering system), and this relationship
will be maintained in the future versions unless otherwise
noted.
The new specification introduced a new field
attribute
in addition to key/type pairs in the
Unicode locale extension. When it is necessary to map a new
Unicode locale identifier with
attribute
field to
a well-formed old locale identifier, a special key name
attribute
with the value of entire
attribute
subtags in the new identifier is used.
For example, a new identifier
ja-u-xxx-yyy-ca-japanese
is mapped to an old
identifier
ja@attribute=xxx-yyy;calendar=japanese
The chart below shows some example mappings between the new
syntax and the old syntax.
Locale Extension Mappings
Old (LDML 1.7 or older)
New
de_DE@collation=phonebook
de_DE_u_co_phonebk
zh_Hant_TW@collation=big5han
zh_Hant_TW_u_co_big5han
th_TH@calendar=gregorian;numbers=thai
th_TH_u_ca_gregory_nu_thai
en_US_POSIX@timezone=America/Los_Angeles
en_US_u_tz_uslax_va_posix
Where the old API is supplied the bcp47 language code, or
vice versa, the recommendation is to:
Have all methods that take the old syntax also take the
new syntax, interpreted correctly. For example,
"zh-TW-u-co-pinyin" and "zh_TW@collation=pinyin" would both
be interpreted as meaning the same.
Have all methods (both for old and new syntax) accept all
possible aliases for keywords and types. For example,
"ar-u-ca-islamicc" would be equivalent to
"ar-u-ca-islamic-civil".
The one exception is where an alias would only be
well-formed with the old syntax, such as "gregorian" (for
"gregory").
Where an API cannot successfully accept the alternate
syntax, throw an exception (or otherwise indicate an error)
so that people can detect that they are using the wrong
method (or wrong input).
Provide a method that tests a purported locale ID string
to determine its status:
well-formed
- syntactically
correct
valid
- well-formed and only uses
registered language subtags, extensions, keywords,
types...
canonical
- valid and no deprecated
codes or structure.
3.8.2 Legacy Variants
Old LDML specification allowed codes other than registered
BCP47
] variant subtags used in Unicode
language and locale identifiers for representing variations of
locale data. Unicode locale identifiers including such variant
codes can be converted to the new [
BCP47
compatible identifiers by following the descriptions below:
Legacy Variant Mappings
Variant Code
Description
AALAND
Åland, variant of "sv" Swedish used in Finland. Use
"sv_AX" to indicate this.
BOKMAL
Bokmål, variant of "no" Norwegian. Use primary language
subtag "nb" to indicate this.
NYNORSK
Nynorsk, variant of "no" Norwegian. Use primary
language subtag "nn" to indicate this.
POSIX
POSIX variation of locale data. Use Unicode locale
extension "-u-va-posix" to indicate this.
POLYTONI
Polytonic, variant of "el" Greek. Use [
BCP47
] variant subtag "polyton" to indicate
this.
SAAHO
The Saaho variant of Afar. Use primary language subtag
"ssy" to indicated this.
When converting to old syntax, the Unicode locale extension
"-u-va-posix" should be converted to the "POSIX" variant,
not
to old extension syntax like "@va=posix". This is an
exception: The other mappings above should not be reversed.
Examples:
en_US_POSIX ↔ en-US-u-va-posix
en_US_POSIX@colNumeric=yes ↔ en-US-u-kn-va-posix
en-US-POSIX-u-kn-true → en-US-u-kn-va-posix
en-US-POSIX-u-kn-va-posix → en-US-u-kn-va-posix
3.8.3 Relation to OpenI18n
The locale id format generally follows the description in
the
OpenI18N Locale Naming Guideline
NamingGuideline
], with some
enhancements. The main differences from the those guidelines
are that the locale id:
does not
include a charset (since the data in LDML format always
provides a representation of all Unicode characters. The
repository is stored in UTF-8, although that can be
transcoded to other encodings as well.),
adds the
ability to have a variant, as in Java
adds the
ability to discriminate the written language by script (or
script variant).
is a
superset of [
BCP47
] codes.
3.9 Transmitting Locale
Information
In a world of on-demand software components, with arbitrary
connections between those components, it is important to get a
sense of where localization should be done, and how to transmit
enough information so that it can be done at that appropriate
place. End-users need to get messages localized to their
languages, messages that not only contain a translation of
text, but also contain variables such as date, time, number
formats, and currencies formatted according to the users'
conventions. The strategy for doing the so-called
JIT
localization
is made up of two parts:
Store and transmit
neutral-format
data wherever
possible.
Neutral-format data is data that is kept in a
standard format, no matter what the local user's
environment is. Neutral-format is also (loosely) called
binary data
, even though it actually could be
represented in many different ways, including a textual
representation such as in XML.
Such data should use accepted standards where
possible, such as for currency codes.
Textual data should also be in a uniform character
set (Unicode/10646) to avoid possible data corruption
problems when converting between encodings.
Localize that data as "
close
" to the end-user as
possible.
There are a number of advantages to this strategy. The
longer the data is kept in a neutral format, the more flexible
the entire system is. On a practical level, if transmitted data
is neutral-format, then it is much easier to manipulate the
data, debug the processing of the data, and maintain the
software connections between components.
Once data has been localized into a given language, it can
be quite difficult to programmatically convert that data into
another format, if required. This is especially true if the
data contains a mixture of translated text and formatted
variables. Once information has been localized into, say,
Romanian, it is much more difficult to localize that data into,
say, French. Parsing is more difficult than formatting, and may
run up against different ambiguities in interpreting text that
has been localized, even if the original translated message
text is available (which it may not be).
Moreover, the closer we are to end-user, the more we know
about that user's preferred formats. If we format dates, for
example, at the user's machine, then it can easily take into
account any customizations that the user has specified. If the
formatting is done elsewhere, either we have to transmit
whatever user customizations are in play, or we only transmit
the user's locale code, which may only approximate the desired
format. Thus the closer the localization is to the end user,
the less we need to ship all of the user's preferences around
to all the places that localization could possibly need to be
done.
Even though localization should be done as close to the
end-user as possible, there will be cases where different
components need to be aware of whatever settings are
appropriate for doing the localization. Thus information such
as a locale code or time zone needs to be communicated between
different components.
3.9.1 Message Formatting
and Exceptions
Windows (
FormatMessage
String.Format
),
Java (
MessageFormat
and ICU (
MessageFormat
umsg
all provide methods of formatting variables (dates, times, etc)
and inserting them at arbitrary positions in a string. This
avoids the manual string concatenation that causes severe
problems for localization. The question is, where to do this?
It is especially important since the original code site that
originates a particular message may be far down in the bowels
of a component, and passed up to the top of the component with
an exception. So we will take that case as representative of
this class of issues.
There are circumstances where the message can be
communicated with a language-neutral code, such as a numeric
error code or mnemonic string key, that is understood outside
of the component. If there are arguments that need to accompany
that message, such as a number of files or a datetime, those
need to accompany the numeric code so that when the
localization is finally at some point, the full information can
be presented to the end-user. This is the best case for
localization.
More often, the exact messages that could originate from
within the component are not known outside of the component
itself; or at least they may not be known by the component that
is finally displaying text to the user. In such a case, the
information as to the user's locale needs to be communicated in
some way to the component that is doing the localization. That
locale information does not necessarily need to be communicated
deep within the component; ideally, any exceptions should
bundle up some language-neutral message ID, plus the arguments
needed to format the message (for example, datetime), but not
do the localization at the throw site. This approach has the
advantages noted above for JIT localization.
In addition, exceptions are often caught at a higher level;
they do not end up being displayed to any end-user at all. By
avoiding the localization at the throw site, it the cost of
doing formatting, when that formatting is not really necessary.
In fact, in many running programs most of the exceptions that
are thrown at a low level never end up being presented to an
end-user, so this can have considerable performance
benefits.
3.10
Unicode Language and Locale IDs
People have very slippery notions of what distinguishes a
language code versus a locale code. The problem is that both
are somewhat nebulous concepts.
In practice, many people use [
BCP47
codes to mean locale codes instead of strictly language codes.
It is easy to see why this came about; because [
BCP47
] includes an explicit region (territory)
code, for most people it was sufficient for use as a locale
code as well. For example, when typical web software receives
an [
BCP47
] code, it will use it as a
locale code. Other typical software will do the same: in
practice, language codes and locale codes are treated
interchangeably. Some people recommend distinguishing on the
basis of "-" versus "_" (for example,
zh-TW
for language
code,
zh_TW
for locale code), but in practice that does
not work because of the free variation out in the world in the
use of these separators. Notice that Windows, for example, uses
"-" as a separator in its locale codes. So pragmatically one is
forced to treat "-" and "_" as equivalent when interpreting
either one on input.
Another reason for the conflation of these codes is that
very
little data in most systems is distinguished by
region alone; currency codes and measurement systems being some
of the few. Sometimes date or number formats are mentioned as
regional, but that really does not make much sense. If people
see the sentence "You will have to adjust the value to
१,२३४.५६७ from ૭૧,૨૩૪.૫૬" (using Indic digits), they would say
that sentence is simply not English. Number format is far more
closely associated with language than it is with region. The
same is true for date formats: people would never expect to see
intermixed a date in the format "2003年4月1日" (using Kanji) in
text purporting to be purely English. There are regional
differences in date and number format — differences which can
be important — but those are different in kind than other
language differences between regions.
As far as we are concerned —
as a completely practical
matter
— two languages are different if they require
substantially different localized resources. Distinctions
according to spoken form are important in some contexts, but
the written form is by far and away the most important issue
for data interchange. Unfortunately, this is not the principle
used in [
ISO639
], which has the fairly
unproductive notion (for data interchange) that only spoken
language matters (it is also not completely consistent about
this, however).
BCP47
can
express a
difference if the use of written languages happens to
correspond to region boundaries expressed as [
ISO3166
] region codes, and has recently added
codes that allow it to express some important cases that are
not distinguished by [
ISO3166
] codes.
These written languages include simplified and traditional
Chinese (both used in Hong Kong S.A.R.); Serbian in Latin
script; Azerbaijani in Arab script, and so on.
Notice also that
currency codes
are different than
currency localizations
. The currency localizations
should largely be in the language-based resource bundles, not
in the territory-based resource bundles. Thus, the resource
bundle
en
contains the localized mappings in English for
a range of different currency codes: USD → US$, RUR → Rub, AUD
→ $A and so on. Of course, some currency symbols are used for
more than one currency, and in such cases specializations
appear in the territory-based bundles. Continuing the example,
en_US
would have USD → $, while
en_AU
would have
AUD → $. (In protocols, the currency codes should always
accompany any currency amounts; otherwise the data is
ambiguous, and software is forced to use the user's territory
to guess at the currency. For some informal discussion of this,
see
JIT Localization
.)
3.10.1 Written Language
Criteria for what makes a written language should be purely
pragmatic;
what would copy-editors say?
If one gave them
text like the following, they would respond that is far from
acceptable English for publication, and ask for it to be
redone:
"Theatre Center News: The date of the last
version of this document was 2003年3月20日. A copy can be
obtained for $50,0 or 1.234,57 грн. We would like to
acknowledge contributions by the following authors (in
alphabetical order): Alaa Ghoneim, Behdad Esfahbod, Ahmed
Talaat, Eric Mader, Asmus Freytag, Avery Bishop, and Doug
Felt."
So one would change it to either B or C below, depending on
which orthographic variant of English was the target for the
publication:
"Theater Center News: The date of the last version of
this document was 3/20/2003. A copy can be obtained for
$50.00 or 1,234.57 Ukrainian Hryvni. We would like to
acknowledge contributions by the following authors (in
alphabetical order): Alaa Ghoneim, Ahmed Talaat, Asmus
Freytag, Avery Bishop, Behdad Esfahbod, Doug Felt, Eric
Mader."
"Theatre Centre News: The date of the last version of
this document was 20/3/2003. A copy can be obtained for
$50.00 or 1,234.57 Ukrainian Hryvni. We would like to
acknowledge contributions by the following authors (in
alphabetical order): Alaa Ghoneim, Ahmed Talaat, Asmus
Freytag, Avery Bishop, Behdad Esfahbod, Doug Felt, Eric
Mader."
Clearly there are many acceptable variations on this text.
For example, copy editors might still quibble with the use of
first versus last name sorting in the list, but clearly the
first list was
not
acceptable English alphabetical
order. And in quoting a name, like "Theatre Centre News", one
may leave it in the source orthography even if it differs from
the publication target orthography. And so on. However, just as
clearly, there limits on what is acceptable English, and
"2003年3月20日", for example, is
not
Note that the language of locale data may differ from the
language of localized software or web sites, when those latter
are not localized into the user's preferred language. In such
cases, the kind of incongruous juxtapositions described above
may well appear, but this situation is usually preferable to
forcing unfamiliar date or number formats on the user as
well.
3.10.2 Hybrid Locale Identifiers
Hybrid locales have intermixed content from 2 (or more)
languages, often with one language's grammatical structure
applied to words in another. These are commonly referred to
with portmanteau words such as
Franglais,
Spanglish
or
Denglish
. Hybrid locales do not
not
reference text simply containing two languages: a book of
parallel text containing English and French, such as the
following, is not Franglais:
On the 24th of
May, 1863, my uncle, Professor Liedenbrock, rushed into
his little house, No. 19 Königstrasse, one of the oldest
streets in the oldest portion of the city of
Hamburg…
Le 24 mai 1863, un
dimanche, mon oncle, le professeur Lidenbrock, revint
précipitamment vers sa petite maison située au numéro 19
de Königstrasse, l’une des plus anciennes rues du vieux
quartier de Hambourg…
While text in a document can be tagged as partly in one
language and partly in another, that is not the same having a
hybrid locale. There is a difference between having a Spanglish
document, and a Spanish document that has some passages quoted
in English. Fine-grained tagging doesn't handle grammatical
combinations like Denglisch “
gedownloadet
”,
which is neither English nor German — similarly the Franglais
downloadé
”.
More importantly, it doesn’t work for the very common use case
for a
unicode_locale_id
locale selection
To communicate requests for localized content and
internationalization services, locales are used. When people
pick a language from a menu, internally they are picking a
locale (en-GB, es-419, etc.). To allow an application to
support Spanglish or Hinglish locale selection,
unicode_locale_id
s can represent
hybrid locales using the T extension key-value 'h0-hybrid'.
(For more information on the T extension, see
Section 3.7
Unicode BCP 47 T
Extension
Examples:
hi-t-
en-h0-hybrid
Hinglish
Hindi-English hybrid locale
ta-t-
en-h0-hybrid
Tanglish
Tamil-English hybrid locale
ba-t-
en-h0-hybrid
Banglish
Bangla-English hybrid locale
en-t-
hi-h0-hybrid
Hinglish
English-Hindi hybrid locale
en-t-
zh-h0-hybrid
Chinglish
English-Chinese hybrid locale
Note: The
unicode_language_id
should be the
language used as the ‘scaffold’: for the fallback locale for
internationalization services, typically used for more of the
core vocabulary/structure in the content. Thus Hinglish
should be represented as hi-t-h0-en where Hindi is the
scaffold, and as en-t-h0-hi where English is.
The value of -t- is a full
unicode_language_id
, and can
contain subtags for script or region where it is important to
include them, as in the following. It may be useful in order to
emphasize the script, even where it is the default script for
the language, if it is not the same as the script of the main
language tag.
ru-t
-en-latn-gb-h0-hybrid
Runglish
Russian with an admixture of British English in Latin
script
ru-t-
en-cyrl-gb-h0-hybrid
Runglish
Russian with an admixture of British English in
Cyrillic script
Should there ever be strong need for hybrids of more than
two languages or for other purposes such as hybrid languages as
the source of translated content, additional structure could be
added.
3.11 Validity Data
The directory
common/validity
contains machine-readable data for validating the language,
region, script, and variant subtags, as well as currency,
subdivisions and measure units. Each file contains a number of
subtags with the following
idStatus
values:
regular
— the standard codes used for
the specific type of subtag
special
— certain exceptional language
codes like 'mul'
(languages only)
unknown
— the code used to indicate the
"unknown", "undetermined" or "invalid" values. For more
information, see
Section 3.5.1
Unknown or Invalid
Identifiers
macroregion
— the standard codes that are
macroregions
(for regions only).
Note that some two-letter region codes are
macroregions, and (in the future) some three-digit codes
may be regular codes.
For details as to which regions are contained within
which macroregions, see the
element of the
supplemental data.
deprecated
— codes that should not be
used. The
element in the
supplementalMeta file contains more information about these
codes, and which codes should be used instead.
private_use
— codes that, for CLDR, are
considered private use. Note that some private-use
codes in a source standard such as BCP47 have defined CLDR semantics, and are considered regular
codes. For more information, see
Section 3.5.3
Private Use Codes
reserved
— codes that are private use in a source standard, but are reserved for future use as regular codes by CLDR.
The list of subtags for each idStatus use a compact format
as a space-delimited list of StringRanges, as defined in
Section
5.3.4 String
Range
The separator for each StringRange is a
"~".
Each measure unit is a sequence of subtags, such as
“angle-arc-minute”. The first subtag provides a general
“category” of the unit.
In version 28.0, the subdivisions in the validity files used
the ISO format, uppercase with a hyphen separating two
components, instead of the BCP 47 format.
4 Locale Inheritance and Matching
The XML format relies on an inheritance model, whereby the
resources are collected into
bundles
, and the bundles
organized into a tree. Data for the many Spanish locales does
not need to be duplicated across all of the countries having
Spanish as a national language. Instead, common data is
collected in the Spanish language locale, and territory locales
only need to supply differences. The parent of all of the
language locales is a generic locale known as
root
Wherever possible, the resources in the root are language &
territory neutral. For example, the collation (sorting) order
in the root is based on the [
DUCET
(see
Root
Collation
). Since English language collation has the
same ordering as the root locale, the 'en' locale data does not
need to supply any collation data, nor do the 'en_US', 'en_GB'
or the any of the various other locales that use English.
Given a particular locale id "en_US_someVariant", the search
chain for a particular resource is the following.
en_US_someVariant
en_US
en
root
The inheritance is often not simple truncation, as will
be seen later in this section.
If a type and key are supplied in the locale id, then
logically the chain from that id to the root is searched for a
resource tag with a given type, all the way up to root. If no
resource is found with that tag and type, then the chain is
searched again without the type.
Thus the data for any given locale will only contain
resources that are different from the parent locale. For
example, most territory locales will inherit the bulk of their
data from the language locale: "en" will contain the bulk of
the data: "en_IE" will only contain a few items like currency.
All data that is inherited from a parent is presumed to be
valid, just as valid as if it were physically present in the
file. This provides for much smaller resource bundles, and much
simpler (and less error-prone) maintenance. At the script or
region level, the "primary" child locale will be empty, since
its parent will contain all of the appropriate resources for
it. For more information see
CLDR Information : Section 9.3
Default
Content
Certain data items depend only on the region specified in a
locale id (by a
unicode_region_subtag
or
an “rg”
Region Override
key) ,
and are obtained from supplemental data rather than through
locale resources. For example:
The currency for the specified region (see
Supplemental
Currency Data
The measurement system for the specified region (see
Measurement
System Data
The week conventions for the specified region (see
Week Data
(For more information on the specific items handled this
way, see
Territory-Based
Preferences
.) These items will be correct for the specified
region regardless of whether a locale bundle actually exists
with the same combination of language and region as in the
locale id. For example, suppose data is requested for the
locale id "fr_US" and there is no bundle for that combination.
Data obtained via locale inheritance, such as currency patterns
and currency symbols, will be obtained from the parent locale
"fr". However, currency amounts would be formatted by default
using US dollars, just displayed in the manner governed by the
locale "fr". When a locale id does not specify a region, the
region-specific items such as those above are obtained from the
likely region for the locale (obtained via
Likely Subtags
).
For the relationship between Inheritance, DefaultContent,
LikelySubtags, and LocaleMatching, see Section 4.2.6
Inheritance vs Related
Information
4.1
Lookup
If a language has more than one script in customary modern
use, then the CLDR file structure in common/main follows the
following model:
lang
lang_script
lang_script_region
lang_region
(aliases to lang_script_region)
4.1.1 Bundle
vs Item Lookup
There are actually two different kinds of inheritance
fallback:
resource bundle lookup
and
resource item lookup
. For the former, a
process is looking to find the first, best resource bundle it
can; for the later, it is fallback within bundles on
individual items, like the translated name for the region "CN"
in Breton.
These are closely related, but distinct, processes. They are
illustrated in the table
Lookup
Differences
, where "key" stands for zero or more key/type
pairs. Logically speaking, when looking up an item for a given
locale, you first do a resource bundle lookup to find the best
bundle for the locale, then you do a inherited item lookup
starting with that resource bundle.
The table
Lookup
Differences
uses the naïve resource bundle lookup for
illustration. More sophisticated systems will get far better
results for resource bundle lookup if they use the algorithm
described in
Section 4.4
Language Matching
. That algorithm
takes into account both the user’s desired locale(s) and the
application’s supported locales, in order to get the best
match.
If the naïve resource bundle lookup is used, the desired
locale needs to be canonicalized using 4.3
Likely Subtags
and the supplemental alias
information, so that locales that CLDR considers identical are
treated as such. Thus eng-Latn-GB should be mapped to en-GB,
and cmn-TW mapped to zh-Hant-TW.
For the purposes of CLDR, everything with the
dtd is treated logically as if it is one resource bundle, even
if the implementation separates data into separate physical
resource bundles. For example, suppose that there is a main XML
file for Nama (naq), but there are no
it because the units are all inherited from root. If the
for modularity in the implementation, the Nama
resource bundle would be empty. However, for purposes of
resource-bundle lookup the resource bundle lookup still stops
at naq.xml.
Lookup Differences
Lookup Type
Example
Comments
Resource bundle
lookup
se-FI →
se →
default-locale* →
root
* The default-locale may have its own inheritance
change; for example, it may be "en-GB → en"
In that case, the chain is expanded by inserting the
chain, resulting in:
se-FI →
se →
fi →
en-GB →
en →
root
Inherited item
lookup
se-FI+key →
se+key →
root_alias*+key
→ root+key
* If there is a root_alias to another key or
locale, then insert that entire chain. For example,
suppose that months for another calendar system have
a root alias to Gregorian months. In that case, the
root alias would change the key, and retry from se-FI
downward. This can happen multiple times.
se-FI+key →
se+key →
root_alias*+key →
se-FI+key2 →
se+key2 →
root_alias*+key2 →
root+key2
Both the resource bundle inheritance and the inherited item
inheritance use the parentLocale data, where available, instead
of simple trunctation.
The fallback is a bit different for these two cases;
internal aliases and keys are are not involved in the bundle
lookup, and the default locale is not involved in the item
lookup. If the default-locale were used in the resource-item
lookup, then strange results will occur. For example, suppose
that the default locale is Swedish, and there is a Nama locale
but no specific inherited item for collation. If the
default-locale were used in resource-item lookup, it would
produce odd and unexpected results for Nama sorting.
The default locale is not even always used in resource
bundle inheritance. For the following services, the fallback is
always directly to the root locale rather than through default
locale.
collation
break iteration
case mapping
transliteration
The lookup for transliteration is yet more
complicated because of the interplay of source and target
locales: see
Part 2 General, Section
10.1
Inheritance.
Thus if there is no Akan locale, for example, asking for a
collation for Akan should produce the root collation,
not
the Swedish collation.
The inherited item lookup must remain stable, because the
resources are built with a certain fallback in mind; changing
the core fallback order can render the bundle structure
incoherent.
Resource bundle lookup, on the other hand, is more flexible;
changes in the view of the "best" match between the input
request and the output bundle are more tolerant, when represent
overall improvements for users. For more information, see
A.1 Element
fallback
Where the LDML inheritance relationship does not match a
target system, such as POSIX, the data logically should be
fully resolved in converting to a format for use by that
system, by adding
all
inherited data to each locale data
set.
For a more complete description of how inheritance applies
to data, and the use of keywords, see
Section 4.2 Inheritance
The locale data does not contain general character
properties that are derived from the
Unicode Character
Database
UAX44
]. That data
being common across locales, it is not duplicated in the
bundles. Constructing a POSIX locale from the CLDR data
requires use of UCD data. In addition, POSIX locales may also
specify the character encoding, which requires the data to be
transformed into that target encoding.
Warning:
If a locale has a different script than its
parent (for example, sr_Latn), then special attention must be
paid to make sure that all inheritance is covered. For example,
auxiliary exemplar characters may need to be empty ("[]") to
block inheritance.
Empty Override:
There is one special value
reserved in LDML to indicate that a child locale is to have no
value for a path, even if the parent locale has a value for
that path. That value is "∅∅∅". For example, if there is no
phrase for "two days ago" in a language, that can be indicated
with:
4.1.2 Lateral
Inheritance
In clearly specified instances, resources may inherit from
within the same locale. For example, currency format symbols
inherit from the number format symbols; the Buddhist calendar
inherits from the Gregorian calendar. This
only
happens
where documented in this specification. In these special cases,
the inheritance functions as normal, up to the root. If the
data is not found along that path, then a second search is
made, logically changing the element/attribute to the alternate
values.
For example, for the locale "en_US" the month data in
not found there, then it inherits from
gregorian
"> in
"en_US", then "en", then in "root".
There is one special case, for items with a "count"
parameter (used to select a plural form). In that case, the
inheritance works as follows:
If there is no value for a path, and that path has a
[@count="x"] attribute and value, then:
If "x" is anything but "other", it falls back to
[@count="other"], within that the same locale.
In the special case of currencies, if the
[@count="other"] value is missing, it falls back to the path
that is completely missing the count item.
If there is no value within the same locale, the same
process is used in the parent locale, and so on.
Examples:
Count
Fallback: normal
Locale
Path
fr-CA
//ldml/units/unitLength[@type="
narrow
"]/unit[@type="mass-gram"]/unitPattern
[@count="x"]
fr-CA
//ldml/units/unitLength[@type="
narrow
"]/unit[@type="mass-gram"]/unitPattern
[@count="other"]
fr
//ldml/units/unitLength[@type="
narrow
"]/unit[@type="mass-gram"]/unitPattern
[@count="x"]
fr
//ldml/units/unitLength[@type="
narrow
"]/unit[@type="mass-gram"]/unitPattern
[@count="other"]
root
//ldml/units/unitLength[@type="
narrow
"]/unit[@type="mass-gram"]/unitPattern
[@count="x"]
root
//ldml/units/unitLength[@type="
narrow
"]/unit[@type="mass-gram"]/unitPattern
[@count="other"]
Note that there may be an alias in root that changes the
path and starts again from the requested locale, such as:
short
']"/>
Count Fallback: currency
Locale
Path
fr-CA
//ldml/numbers/currencies/currency[@type="CAD"]/displayName
[@count="x"]
fr-CA
//ldml/numbers/currencies/currency[@type="CAD"]/displayName
[@count="other"]
fr-CA
//ldml/numbers/currencies/currency[@type="CAD"]/displayName
fr
//ldml/numbers/currencies/currency[@type="CAD"]/displayName
[@count="x"]
fr
//ldml/numbers/currencies/currency[@type="CAD"]/displayName
[@count="other"]
fr
//ldml/numbers/currencies/currency[@type="CAD"]/displayName
root
//ldml/numbers/currencies/currency[@type="CAD"]/displayName
[@count="x"]
root
//ldml/numbers/currencies/currency[@type="CAD"]/displayName
[@count="other"]
root
//ldml/numbers/currencies/currency[@type="CAD"]/displayName
4.1.3 Parent Locales
In some cases, the normal truncation inheritance does not
function well. This happens when:
The child locale is of a different script. In this case,
mixing elements from the parent into the child data results
in a mishmash.
A large number of child locales behave similarly, and
differently from the truncation parent.
The
parentLocale
element is
used to override the normal inheritance when accessing CLDR
data.
For case 1, the children are script locales, and the parent
is "root". For example:
For case 2, the children and parent share the same primary
language, but the region is changed. For example:
Collation data, however, is an exception. Since collation
rules do not truly inherit data from the parent, the
parentLocale element is not necessary and not used for
collation. Thus, for a locale like zh_Hant in the example
above, the parentLocale element would dictate the parent as
"root" when referring to main locale data, but for collation
data, the parent locale would still be "zh", even though the
parentLocale element is present for that locale.
Since parentLocale information is not localizable on a per
locale basis, the parentLocale information is contained in
CLDR’s
supplemental data.
When a
parentLocale
element is
used to override normal inheritance, the following invariants
must always be true:
If X is the parentLocale of Y, then either X is the root
locale, or X has the same base language code as Y. For
example, the parent of "en" cannot be "fr", and the parent of
"en_YY" cannot be "fr" or "fr_XX".
If X is the parentLocale of Y, Y must not be a base
language locale. For example, the parent of "en" cannot be
"en_XX".
There can never be cycles, such as: X parent of Y ...
parent of X.
4.2
Inheritance and Validity
The following describes in more detail how to determine the
exact inheritance of elements, and the validity of a given
element in LDML.
4.2.1 Definitions
Blocking
elements are those whose subelements do not
inherit from parent locales. For example, a
element is a blocking element: everything in a
as far as inheritance is concerned. For more information, see
Section 5.5 Valid Attribute
Values
Attributes that serve to distinguish multiple elements at
the same level are called
distinguishing
attributes. For
example, the
type
attribute distinguishes different
elements in lists of translations, such as:
Distinguishing attributes affect inheritance; two elements
with different distinguishing attributes are treated as
different for purposes of inheritance. For more information,
see
Section 5.5 Valid
Attribute Values
. Other attributes are called
nondistinguishing (or informational) attributes. These carry
separate information, and do not affect inheritance.
For any element in an XML file,
an element chain
is a
resolved [
XPath
] leading from the root to
an element, with attributes on each element in alphabetical
order. So in, say,
we may have:
...
Which gives the following element chains (among others):
//ldml/identity/version[@number="1.1"]
//ldml/localeDisplayNames/languages/language[@type="ar"]
An element chain A is an
extension
of an element
chain B if B is equivalent to an initial portion of A. For
example, #2 below is an extension of #1. (Equivalent, depending
on the tree, may not be "identical to". See below for an
example.)
//ldml/localeDisplayNames
//ldml/localeDisplayNames/languages/language[@type="ar"]
An LDML file can be thought of as an ordered list of
element pairs
:
element chains are all the chains for the end-nodes. (This
works because of restrictions on the structure of LDML,
including that it does not allow mixed content.) The ordering
is the ordering that the element chains are found in the file,
and thus determined by the DTD.
For example, some of those pairs would be the following.
Notice that the first has the null string as element
contents.
//ldml/identity/version[@number="1.1"]
""
//ldml/localeDisplayNames/languages/language[@type="ar"]
"Αραβικά"
Note:
There are two exceptions to this:
Blocking nodes and their contents are treated as a
single end node.
In terms of computing inheritance, the element pair
consists of the element chain plus all distinguishing
attributes; the value consists of the value (if any) plus
any nondistinguishing attributes.
Thus instead of the element pair being (a) below, it is
(b):
//ldml/dates/calendars/calendar[@type='gregorian']/week/weekendStart[@day='sun'][@time='00:00']
"">
//ldml/dates/calendars/calendar[@type='gregorian']/week/weekendStart
[@day='sun'][@time='00:00']
Two LDML element chains are
equivalent
when they
would be identical if all attributes and their values were
removed — except for distinguishing attributes. Thus the
following are equivalent:
//ldml/localeDisplayNames/languages/language[@type="ar"]
//ldml/localeDisplayNames/languages/language[@type="ar"][@draft="unconfirmed"]
For any locale ID, an
locale chain
is an ordered list
starting with the root and leading down to the ID. For
example:
4.2.2 Resolved Data File
To produce fully resolved locale data file from CLDR for a
locale ID L, you start with L, and successively add unique
items from the parent locales until you get up to root. More
formally, this can be expressed as the following procedure.
Let Result be initially L.
For each Li in the locale chain for L, starting at L and
going up to root:
Let Temp be a copy of the pairs in the LDML file for
Li
Replace each alias in Temp by the resolved list of
pairs it points to.
The resolved list of pairs is obtained by
recursively applying this procedure.
That alias now blocks any inheritance from the
parent. (See
Section
5.1 Common Elements
for an example.)
For each element pair P in Temp:
If P does not contain a blocking element, and
Result does not have an element pair Q with an
equivalent element chain, add P to Result.
Notes:
When adding an element pair to a result, it has to go in
the right order for it to be valid according to the DTD.
The identity element and its children are unaffected by
resolution.
The LDML data must be constructed so as to avoid
circularity in step 2.2.
4.2.3 Valid Data
The attribute
draft="x"
in LDML means that the data
has not been approved by the subcommittee. (For more
information, see
Process
). However,
some data that is not explicitly marked as
draft
may be
implicitly
draft
, either because it inherits it from a
parent, or from an enclosing element.
Example 2.
Suppose that new locale data is added for
af (Afrikaans). To indicate that all of the data is
unconfirmed
, the attribute can be added to the top
level.
Any data can be added to that file, and the status will all
be draft=
unconfirmed
. Once an item is vetted—
whether
it is inherited or explicitly in the file
—then its status
can be changed to
approved
. This can be done either by
leaving draft="unconfirmed" on the enclosing element and
marking the child with draft="approved", such as:
However, normally the draft attributes should be
canonicalized, which means they are pushed down to leaf nodes
as described in
Section 5.6
Canonical Form
. If an LDML file does has draft
attributes that are not on leaf nodes, the file should be
interpreted as if it were the canonicalized version of that
file.
More formally, here is how to determine whether data for an
element chain E is implicitly or explicitly draft, given a
locale L. Sections 1, 2, and 4 are simply formalizations of
what is in LDML already. Item 3 adds the new element.
4.2.4 Checking for Draft
Status
Parent Locale Inheritance
Walk through the locale chain until you find a locale
ID L' with a data file D. (L' may equal L).
Produce the fully resolved data file D' for D.
In D', find the first element pair whose element
chain E' is either equivalent to or an extension of
E.
If there is no such E', return
true
If E' is not equivalent to E, truncate E' to the
length of E.
Enclosing Element Inheritance
Walk through the elements in E', from back to front.
If you ever encounter draft=
, return
If L' = L, return
false
Missing File Inheritance
Otherwise, walk again through the elements in E',
from back to front.
If you encounter a validSubLocales attribute
(deprecated):
If L is in the attribute value, return
false
Otherwise return
true
Otherwise
Return
true
The validSubLocales in the most specific (farthest from root
file) locale file "wins" through the full resolution step (data
from more specific files replacing data from less specific
ones).
4.2.5 Keyword and Default
Resolution
When accessing data based on keywords, the following process
is used. Consider the following example:
The locale 'de' has collation types A, B, C, and no
The locale 'de_CH' has
Here are the searches for various combinations.
User Input
Lookup in Locale
For
Comment
de_CH
no keyword
de_CH
default collation type
finds "B"
de_CH
collation type=B
not found
de
collation type=B
found
de
no keyword
de
default collation type
not found
root
default collation type
finds "standard"
de
collation type=standard
not found
root
collation type=standard
found
de_u_co_A
de
collation type=A
found
de_u_co_standard
de
collation type=standard
not found
root
collation type=standard
found
de_u_co_foobar
de
collation type=foobar
not found
root
collation type=foobar
not found, starts looking for default
de
default collation type
not found
root
default collation type
finds "standard"
de
collation type=standard
not found
root
collation type=standard
found
Examples of "search" collator lookup; 'de' has a
language-specific version, but 'en' does not:
User Input
Lookup in Locale
For
Comment
de_CH_u_co_search
de_CH
collation type=search
not found
de
collation type=search
found
en_US_u_co_search
en_US
collation type=search
not found
en
collation type=search
not found
root
collation type=search
found
Examples of lookup for Chinese collation types. Note:
All of the Chinese-specific collation types are provided
in the 'zh' locale
For 'zh' the
for 'zh_Hant' the
However any of the available Chinese collation types can be
explicitly requested for any Chinese locale.
User Input
Lookup in Locale
For
Comment
zh_Hant
no keyword
zh_Hant
default collation type
finds "stroke"
zh_Hant
collation type=stroke
not found
zh
collation type=stroke
found
zh_Hant_HK_u_co_pinyin
zh_Hant_HK
collation type=pinyin
not found
zh_Hant
collation type=pinyin
not found
zh
collation type=pinyin
found
zh
no keyword
zh
default collation type
finds "pinyin"
zh
collation type=pinyin
found
Note:
It is an invariant that the default in root
for a given element must
always be a value that exists in root. So you can not have
the following in root:
For identifiers, such as language codes, script codes,
region codes, variant codes, types, keywords, currency symbols
or currency display names, the default value is the identifier
itself whenever if no value is found in the root. Thus if there
is no display name for the region code 'QA' in root, then the
display name is simply 'QA'.
4.2.6
Inheritance vs Related Information
There are related types of data and processing that are easy
to confuse:
Inheritance
Part of the internal mechanism used by CLDR
to organize and manage locale data. This is used to share
common resources, and ease maintenance, and provide the
best fallback behavior in the absence of data.
Should
not be used for locale matching or likely
subtags.
Example:
parent(en_AU) ⇒ en_001
parent(en_001) ⇒ en
parent(en) ⇒ root
Data:
supplementalData.xml
Spec:
Section
4.2
Inheritance and Validity
DefaultContent
Part of the internal mechanism used by CLDR
to manage locale data. A particular sublocale is designated
the defaultContent for a parent, so that the parent
exhibits consistent behavior.
Should not be used for
locale matching or likely subtags.
Example:
addLikelySubtags(sr-ME) ⇒ sr-Latn-ME,
minimize(de-Latn-DE) ⇒ de
Data:
supplementalMetadata.xml
Spec:
Part 6: Section 9.3
Default
Content
LikelySubtags
Provides most likely full subtag (script
and region) in the absence of other information. A core
component of LocaleMatching.
Example:
addLikelySubtags(zh) ⇒ zh-Hans-CN
addLikelySubtags(zh-TW) ⇒ zh-Hant-TW
minimize(zh-Hans, favorRegion) ⇒ zh-TW
Data:
likelySubtags.xml
Spec:
Section
4.3 Likely
Subtags
LocaleMatching
Provides the best match for the user’s
language(s) among an application’s supported
languages.
Example:
bestLocale(userLangs=
appLangs=
Data:
languageInfo.xml
Spec:
Section
4.4
Language Matching
4.3 Likely Subtags
There are a number of situations where it is useful to be
able to find the most likely language, script, or region. For
example, given the language "zh" and the region "TW", what is
the most likely script? Given the script "Thai" what is the
most likely language or region? Given the region TW, what is
the most likely language and script?
Conversely, given a locale, it is useful to find out which
fields (language, script, or region) may be superfluous, in the
sense that they contain the likely tags. For example, "en_Latn"
can be simplified down to "en" since "Latn" is the likely
script for "en"; "ja_Jpan_JP" can be simplified down to
"ja".
The
likelySubtag
supplemental data provides default
information for computing these values. This data is based on
the default content data, the population data, and the the
suppress-script data in [
BCP47
]. It is
heuristically derived, and may change over time.
For the relationship between Inheritance, DefaultContent,
LikelySubtags, and LocaleMatching, see
Section
4.2.6
Inheritance vs
Related Information
To look up data in the table, see if a locale matches one of
the
from
attribute values. If so, fetch the
corresponding
to
attribute value. For example, the
Chinese data looks like the following:
So looking up "zh_TW" returns "zh_Hant_TW", while looking up
"zh" returns "zh_Hans_CN".
In more detail, the data is designed to be used in the
following operations.
Note that as of CLDR v24, any field present in the 'from'
field, is also present in the 'to' field, so an input field
will not change in "Add Likely Subtags" operation. The data and
operations can also be used with language tags using [
BCP47
] syntax, with the appropriate changes. In
addition, certain common 'denormalized' language subtags such
as 'iw' (for 'he') may occur in both the 'from' and 'to'
fields. This allows for implementations that use those
denormalized subtags to use the data with only minor changes to
the operations.
An implementation may choose exclude language tags with the language subtag "und" from the following operation. In such a case, only the canonicalization is done. An implementation can declare that it is doing the exclusion, or can take a parameter that controls whether or not to do it.
Add Likely Subtags:
Given a source locale
X, to return a locale Y where the empty subtags have been
filled in by the most likely subtags.
This is written as X
⇒ Y ("X maximizes to Y").
A subtag is called
empty
if it is a missing script
or region subtag, or it is a base language subtag with the
value "und". In the description below, a subscript on a subtag
indicates which tag it is from:
is in the source,
is in a match, and
is in the final result.
This operation is performed in the following way.
Canonicalize.
Make sure the input locale is in canonical form: uses
the right separator, and has the right casing.
Replace any deprecated subtags with their canonical
values using the
metadata. Use the first value in the replacement list, if
it exists. Language tag replacements may have multiple
parts, such as "sh" ➞ "sr_Latn" or mo" ➞ "ro_MD". In such
a case, the original script and/or region are retained if
there is one. Thus "sh_Arab_AQ" ➞ "sr_Arab_AQ", not
"sr_Latn_AQ".
If the tag is grandfathered (see
data), then return it.
Remove the script code 'Zzzz' and the region code
'ZZ' if they occur.
Get the components of the cleaned-up source tag
(language
, script
and
region
), plus any variants and
extensions.
Lookup.
Lookup each of the following in
order, and stop on the first match:
language
_script
_region
language
_region
language
_script
language
und
_script
Return
If there is no match,either return
an error value, or
the match for "und" (in APIs where a valid
language tag is required).
Otherwise there is a match =
language
_script
_region
Let x
= x
if x
is
not empty, and x
otherwise.
eturn the
language tag composed of
language
script
_ region
+ variants +
extensions
The lookup can be optimized. For example, if any of the tags
in Step 2 are the same as previous ones in that list, they do
not need to be tested.
Example1:
Input is ZH-ZZZZ-SG.
Normalize to zh_SG.
Lookup in table. No match.
Lookup zh, and get the match (zh_Hans_CN). Substitute
SG, and return zh_Hans_SG.
To find the most likely language for a country, or language
for a script, use "und" as the language subtag. For example,
looking up "und_TW" returns zh_Hant_TW.
A goal of the algorithm is that if X ⇒ Y, and X' results
from replacing an empty subtag in X by the the corresponding
subtag in Y, then X' ⇒ Y. For example, if und_AF ⇒ fa_Arab_AF,
then:
fa_Arab_AF ⇒ fa_Arab_AF
und_Arab_AF ⇒ fa_Arab_AF
fa_AF ⇒ fa_Arab_AF
There are a small number of exceptions to this goal in the
current data, where X ∈ {und_Bopo, und_Brai, und_Cakm,
und_Limb, und_Shaw}.
Remove
Likely Subtags:
Given a
locale, remove any fields that Add Likely Subtags would
add.
The reverse operation removes fields that would be added by
the first operation.
First get
max = AddLikelySubtags(inputLocale). If an error is signaled,
return it.
Remove
the variants from max.
Then for
trial
in {language, language _ region, language _
script}
If
AddLikelySubtags(
trial
) = max, then return
trial
+ variants.
If you do
not get a match, return max + variants.
Example:
Input is zh_Hant. Maximize to get zh_Hant_TW.
zh => zh_Hans_CN. No match, so continue.
zh_TW => zh_Hant_TW. Matches, so return zh_TW.
A variant of this favors the script over the region, thus
using {language, language_script, language_region} in the
above. If that variant is used, then the result in this example
would be zh_Hant instead of zh_TW.
4.4 Language Matching
) >
matchVariable*, languageMatch* ) >
matchVariable*, languageMatch* ) >
Implementers are often faced with the issue of how to match
the user's requested languages with their product's supported
languages. For example, suppose that a product supports {ja-JP,
de, zh-TW}. If the user understands written American English,
German, French, Swiss German, and Italian, then
de
would be the best match; if s/he
understands only Chinese (zh), then zh-TW would be the best
match.
The standard truncation-fallback algorithm does not work
well when faced with the complexities of natural language. The
language matching data is designed to fill that gap. Stated in
those terms, language matching can have the effect of a more
complex fallback, such as:
sr-Cyrl-RS
sr-Cyrl
sr-Latn-RS
sr-Latn
sr
hr-Latn
hr
Language matching is used to find the best supported locale
ID given a requested list of languages. The requested list
could come from different sources, such as such as the user's
list of preferred languages in the OS Settings, or from a
browser Accept-Language list. For example, if my native tongue
is English, I can understand Swiss German and German, my French
is rusty but usable, and Italian basic, ideally an
implementation would allow me to select {gsw, de, fr} as my
preferred list of languages, skipping Italian because my
comprehension is not good enough for arbitrary content.
Language Matching can also be used to get fallback data
elements. In many cases, there may not be full data for a
particular locale. For example, for a Breton speaker, the best
fallback if data is unavailable might be French. That is,
suppose we have found a Breton bundle, but it does not contain
translation for the key "CN" (for the country China). It is
best to return "chine", rather than falling back to the value
default language such as Russian and getting "Кітай". The
language matching data can be used to get the closest fallback
locales (of those supported) to a given language.
For the relationship between Inheritance, DefaultContent,
LikelySubtags, and LocaleMatching, see
Section
4.2.6
Inheritance vs
Related Information
When such fallback is used for inherited item lookup, the
normal order of inheritance is used for inherited item lookup,
except that before using any data from
root
the data for the fallback locales would be used if available.
Language matching does not interact with the fallback of
resources
within the locale-parent chain
. For
example, suppose that we are looking for the value for a
particular path
in
nb-NO
In the absence of aliases, normally the following lookup is
used.
nb-NO
nb
root
That is, we first look in
nb-NO
. If there
is no value for
there, then we look in
nb
. If there is no value for
there, we return the value for
in root (or a code value, if there is
nothing there). Remember that if there is an alias element
along this path, then the lookup may restart with a different
path in
nb-NO
(or another locale).
However, suppose that
nb-NO
has the
fallback values
[nn da sv en]
, derived from
language matching. In that case, an implementation
may
progressively lookup each of the listed locales, with the
appropriate substitutions, returning the first value that is
not found in
root
. This follows roughly the
following pseudocode:
value = lookup(P, nb-NO); if (locationFound != root)
return value;
value = lookup(P, nn-NO); if (locationFound != root)
return value;
value = lookup(P, da-NO); if (locationFound != root)
return value;
value = lookup(P, sv-NO); if (locationFound != root)
return value;
value = lookup(P, en-NO); return value;
The locales in the fallback list are not used recursively.
For example, for the lookup of a path in nb-NO, if
fr
were a fallback value for
da
, it would not matter for the above process.
Only the original language matters.
The language matching data is intended to be used according
to the following algorithm. This is a logical description, and
can be optimized for production in many ways. In this
algorithm, the languageMatching data is interpreted as an
ordered list.
Distances between given pair of subtags can be larger or smaller than the typical distances. For example, the distance between en and en-GB can be greater than those between en-GB and en-IE. In some cases, language and/or script differences can be as small as the typical region difference. (Example: sr-Latn vs. sr-Cyrl).
The distances resulting from the table are not linear, but are rather chosen to produce expected results. So a distance of 10 is not necessarily twice as "bad" as a distance of 5. Implementations may want to have a mode where script distances should swamp language distances. The tables are built such that this can be accomplished by multiplying the language distance by 0.25.
The language matching algorithm takes a list of a user’s
desired languages, and a list of the application’s supported
languages.
Set the best weighted distance BWD to ∞
Set the best desired language BD to null
Set the best supported language BS to null
For each desired language D
Compute a demotion value F, based on the position in
the list.
This demotion value is up to the implementation,
but is typically a positive value that increases
according to how far D is from the start of the
desired language list.
For each supported language S
Find the matching distance MD as described
below.
Compute the weighted distance as F + MD
If WD < BD
BWD = WD
BD = D
BS = S
If the BWD is less than a threshold, return
The threshold is implementation-defined, typically
set to greater than a default region difference, and less
than a default script difference.
Otherwise BD = the default supported language (like
English); return
To find the matching distance MD between any two languages,
perform the following steps.
Maximize each language using Section 4.3
Likely Subtags
und is a special case: see below.
Set the match-distance MD to 0
For each subtag in {language, script, region}
If respective subtags in each language tag are
identical, remove the subtag from each (logically) and
continue.
Traverse the languageMatching data until a match is
found.
* matches any field.
If the oneway flag is false, then the match is
symmetric; otherwise only match one direction.
For region matching, use the mechanisms in
Section 4.4.1
Enhanced Language
Matching
Add the
distance
attribute value to MD.
This used to be a
percent
attribute value, which was 100 - the distance attribute value.
Remove the subtag from each (logically)
Return MD
It is typically useful to set the discount factor between
successive elements of the desired languages list to be
slightly greater than the default region difference. That
avoids the following problem:
Supported languages:
"de, fr, ja"
User's desired languages:
"de-AT, fr"
This user would expect to get "de", not "fr". In practice,
when a user selects a list of preferred languages, they don't
include all the regional variants ahead of their second base
language. Yet while the user's desired languages really doesn't
tell us the priority ranking among their languages, normally
the fall-off between the user's languages is substantially
greater than regional variants. But unless F is greater than
the distance between de-AT and de-DE, then the user’s
second-choice language would be returned.
The base language subtag "und" is a special case. Suppose we
have the following situation:
desired languages: {und, it}
supported languages: {en, it}
resulting language: en
Part of this is because 'und' has a special function in BCP
47; it stands in for 'no supplied base language'. To prevent
this from happening, if the desired base language is und, the
language matcher should not apply likely subtags to
it.
Examples:
For example, suppose that nn-DE and nb-FR are being
compared. They are first maximized to nn-Latn-DE and
nb-Latn-FR, respectively. The list is searched. The first match
is with "*-*-*", for a match of 96%. The languages are
truncated to nn-Latn and nb-Latn, then to nn and nb. The first
match is also for a value of 96%, so the result is 92%.
Note that language matching is orthogonal to the how closely
two languages are related linguistically. For example, Breton
is more closely related to Welsh than to French, but French is
the better match (because it is more likely that a Breton
reader will understand French than Welsh). This also
illustrates that the matches are often asymmetric: it is not
likely that a French reader will understand Breton.
The "*" acts as a wild card, as shown in the following
example:
When the language+region is not matched, and there is
otherwise no reason to pick among the supported regions for
that language, then some measure of geographic "closeness" can
be used. The results may be more understandable by users.
Looking for en-SK, for example, should fall back to something
within Europe (eg en-GB) in preference to something far away
and unrelated (eg en-SG). Such a closeness metric does not need
to be exact; a small amount of data can be used to give an
approximate distance between any two regions. However, any such
data must be used carefully; although Hong Kong is closer to
India than to the UK, it is unlikely that en-IN would be a
better match to en-HK than en-GB would.
4.4.1
Enhanced Language Matching
The enhanced format for language matching adds structure to
enable better matching of languages. It is distinguished by
having a suffix "_new" on the type, as in the example below.
The extended structure allows matching to take into account
broad similarities that would give better results. For example,
for English the regions that are or inherit from US
(AS|GU|MH|MP|PR|UM|VI|US) form a “cluster”. Each region in that
cluster should be closer to each other than to any other
region. And a region outside the cluster should be closer to
another region outside that cluster than to one inside. We get
this issue with the “world languages” like English, Spanish,
Portuguese, Arabic, etc.
Example:
The
matchVariable
allows for a rule to
matche to multiple regions, as illustrated by
$maghreb
. The syntax is simple: it allows for
+ for
union
and - for
set difference
, but no
precedence. So A+B-A+D is interpreted as (((A+B)-A)+D), not as
(A+B)-(A+D). The variable
id
has a value of
the form [$][a-zA-Z0-9]+. If $X is defined, then $!X
automatically means all those regions that are not in $X.
When the set is interpreted, then macrolanguages
are (logically) transformed into a list of their contents, so
“053+GB” → “AU+GB+NF+NZ”. This is done recursively, so 009 →
“053+054+057+061+QO” → “AU+NF+NZ+FJ+NC+PG+SB +VU...”. Note that
we use 019 for all of the Americas in the variables above,
because en-US should be in the same cluster as es-419 and its
contents.
In the rules, the percent value (100..0) is replaced by a
distance
value, which is the inverse
(0..100).
These new variables and rules divide up the world
into clusters, where items in the same clusters (for specific
languages) get the normal regional difference, and items in
different clusters get different weights.
Each cluster can have one or more associated
paradigmLocales
. These are locales that are
preferred within a cluster. So when matching desired=[en-SA]
against [en-GU en en-IN en-GB], the value en-GB is returned.
Both of {en-GU en} are in a different cluster. While {en-IN
en-GB} are in the same cluster, and the same distance from
en-SA, the preference is given to en-GB because it is in the
paradigm locales. It would be possible to express this in
rules, but using this mechanism handles these very common cases
without bulking up the tables.
The
paradigmLocales
also allow
matching to macroregions. For example, desired=[es-419] should
match to {es-MX} more closely than to {es}, and vice versa:
{es-MX} should match more closely to {es-419} than to {es}. But
es-MX should match more closely to es-419 than to any of the
other es-419 sublocales. In general, in the absence of other
distance data, there is a ‘paradigm’ in each cluster that the
others should match more closely to: en(-US), en-GB, es(-ES),
es-419, ru(-RU)...
XML Format
There are two kinds of data that can be expressed in LDML:
language-dependent data and supplementary data. In either case,
data can be split across multiple files, which can be in
multiple directory trees.
For example, the language-dependent data for Japanese in
CLDR is present in the following files:
common/collation/ja.xml
common/main/ja.xml
common/rbnf/ja.xml
common/segmentations/ja.xml
Data for cased languages such as French are in files
like:
common/casing/fr.xml
The status of the data is the same, whether or not data is
split. That is, for the purpose of validation and lookup, all
of the data for the above ja.xml files is treated as if it was
in a single file. These files have the
element and use ldml.dtd. The file name must match the identity
element. For example, the
contain the following elements:
Supplemental data can have different root elements,
currently: ldmlBCP47, supplementalData, keyboard, and platform.
Keyboard and platform files are considered distinct. The
ldmlBCP47 files and supplementalData files that have the same
root are all logically part of the same file; they are simply
split into separate files for convenience. Implementations may
split the files in different ways, also for their convenience.
The files in /properties are also supplemental data files, but
are structured like UCD properties.
For example, supplemental data relating to Japan or the
Japanese writing are in:
common/supplemental/ (in many files, such as
supplementalData.xml)
common/transforms/Hiragana-Katakana.xml
common/transforms/Hiragana-Latin.xml
common/properties/scriptMetadata.txt
common/bcp47/calendar.xml
uca/allkeys_CLDR.txt (sorting)
/keyboards/chromeos/ja-t-k0-chromeos.xml
...
Like the
match internal data: in particular, the locale attribute on the
keyboard element must have a value that corresponds to the file
name, such as
file af-t-k0-android.xml.
The following sections describe the structure of the XML
format for language-dependent data. The more precise syntax is
in the ldml.dtd file
; however, the DTD does not describe all
the constraints on the structure.
To start with, the root element is
following DTD entry:
(identity,(alias|(fallback*,localeDisplayNames?,layout?,contextTransforms?,characters?,
delimiters?,measurement?,dates?,numbers?,units?,listPatterns?,collations?,posix?,
segmentations?,rbnf?,annotations?,metadata?,references?,special*)))>
The XML structure is stable over releases. Elements and
attributes may be deprecated: they are retained in the DTD but
their usage is strongly discouraged. In most cases, an
alternate structure is provided for expressing the information.
There is only one exception: newer DTDs cannot be used with
version 1.1 files, without some modification.
In general, all translatable text in this format is in
element contents, while attributes are reserved for types and
non-translated information (such as numbers or dates). The
reason that attributes are not used for translatable text is
that spaces are not preserved, and we cannot predict where
spaces may be significant in translated material.
There are two kinds of elements in LDML:
rule
elements and
structure
elements. For structure elements,
there are restrictions to allow for effective inheritance and
processing:
There is no "mixed" content: if an element has textual
content, then it cannot contain any elements.
The [
XPath
] leading to the content
is unique; no two different pieces of textual content have
the same [
XPath
].
Rule elements do not have this restriction, but also do not
inherit, except as an entire block. The rule elements are
listed in serialElements in the supplemental metadata. See also
Section 4.2 Inheritance
and Validity
. For more technical details, see
Updating-DTDs
Note that the data in examples given below is purely
illustrative, and does not match any particular language. For a
more detailed example of this format, see [
Example
]. There is also a DTD for this format, but
remember that the DTD alone is not sufficient to understand
the semantics, the constraints, nor the
interrelationships between the different elements and
attributes
. You may wish to have copies of each of these to
hand as you proceed through the rest of this document.
In particular, all elements allow for draft versions to
coexist in the file at the same time. Thus most elements are
marked in the DTD as allowing multiple instances. However,
unless an element is listed as a serialElement, or has a
distinguishing attribute, it can only occur once as a
subelement of a given element. Thus, for example, the following
is illegal even though allowed by the DTD:
There must be only one instance of these per parent, unless
there are other distinguishing attributes (such as an alt
element).
In general, LDML data should be in NFC format. However,
certain elements may need to contain characters that are not in
NFC, including exemplars, transforms, segmentations, and
p/s/t/i/pc/sc/tc/ic rules in collation. These elements must not
be normalized (either to NFC or NFD), or their meaning may be
changed. Thus LDML documents must not be normalized as a whole.
To prevent problems with normalization, no element value can
start with a combining slash (U+0338 COMBINING LONG SOLIDUS
OVERLAY).
Lists, such as
singleCountries
are space-delimited. That
means that they are separated by one or more XML whitespace
characters,
singleCountries
preferenceOrdering
references
5.1 Common Elements
At any level in any element, two special elements are
allowed.
5.1.1
Element special
This element is designed to allow for arbitrary additional
annotation and data that is product-specific. It has one
required attribute
xmlns
, which
specifies the XML
namespace
of the
special data. For example, the following used the version 1.0
POSIX special element.
">
%posix;
]>
...
Yes
No
^[Yy].*
^[Nn].*
5.1.1.1
Sample Special Elements
The elements in this section are
not
part of
the Locale Data Markup Language 1.0 specification. Instead,
they are special elements used for application-specific data to
be stored in the Common Locale Repository. They may change or
be removed future versions of this document, and are present
her more as examples of how to extend the format. (Some of
these items may move into a future version of the Locale Data
Markup Language specification.)
The above examples are old versions: consult the
documentation for the specific application to see which should
be used.
These DTDs use namespaces and the special element. To
include one or more, use the following pattern to import the
special DTDs that are used in the file:
1.0
" encoding="
UTF-8
" ?>
icu
SYSTEM "
">
openOffice
SYSTEM "
">
%icu;
%openOffice;
]>
Thus to include just the ICU DTD, one uses:
1.0
" encoding="
UTF-8
" ?>
">
%icu;
]>
Note:
A previous version of this document contained
a special element for
ISO TR 14652
compatibility data. That element has been
withdrawn, pending further investigation, since 14652 is a
Type 1 TR: "when the required support cannot be obtained for
the publication of an International Standard, despite
repeated effort". See the ballot comments on
14652 Comments
for details on the 14652 defects. For
example, most of these patterns make little provision for
substantial changes in format when elements are empty, so are
not particularly useful in practice. Compare, for example,
the mail-merge capabilities of production software such as
Microsoft Word or OpenOffice.
Note:
While the CLDR specification guarantees
backwards compatibility, the definition of specials is up to
other organizations. Any assurance of backwards compatibility
is up to those organizations.
A number of the elements above can have extra information
for
openoffice.org
, such as the following
example:
IGNORE_CASE
5.1.2 Element alias
The contents of any element in root can be replaced by an
alias, which points to the path where the data can be
found.
Aliases will only ever appear in root with the form
//ldml/.../alias[@source="locale"][@path="..."].
Consider the following example in root:
If the locale "de_DE" is being accessed for a month name for
format/abbreviated, then a resource bundle at "de_DE" will be
searched for a resource element at the that path. If not found
there, then the resource bundle at "de" will be searched, and
so on. When the alias is found in root, then the search is
restarted, but searching for format/
wide
element instead of format/abbreviated.
If the
path
attribute is present, then its value is
an [
XPath
] that points to a different node
in the tree. For example:
The default value if the path is not present is the same
position in the tree. All of the attributes in the [
XPath
] must be
distinguishing
elements. For
more details, see
Section
4.2 Inheritance and Validity
There is a special value for the source attribute, the
constant
source="locale"
. This special value is
equivalent to the locale being resolved. For example, consider
the following example, where locale data for 'de' is being
resolved:
Inheritance with
source="locale"
Root
de
Resolved
1
2
11
12
11
12
22
11
22
The first row shows the inheritance within the
element, whereby
shows the inheritance within the
,
root, but from an alias there. The alias in root is logically
replaced not by the elements in root itself, but by elements in
the 'target' locale.
For more details on data resolution, see
Section 4.2 Inheritance and
Validity
Aliases must be resolved recursively. An alias may point to
another path that results in another alias being found, and so
on. For example, looking up Thai buddhist abbreviated months
for the locale
xx-YY
may result in the
following chain of aliases being followed:
../../calendar[@type="buddhist"]/months/monthContext[@type="format"]/monthWidth[@type="abbreviated"]
xx-YY → xx → root // finds alias that changes path to:
../../calendar[@type="gregorian"]/months/monthContext[@type="format"]/monthWidth[@type="abbreviated"]
xx-YY → xx → root // finds alias that changes path to:
../../calendar[@type="gregorian"]/months/monthContext[@type="format"]/monthWidth[@type="wide"]
xx-YY → xx // finds value here
It is an error to have a circular chain of aliases. That is,
a collection of LDML XML documents must not have situations
where a sequence of alias lookups (including inheritance and
lateral inheritance) can be followed indefinitely without
terminating.
5.1.3 Element displayName
Many elements can have a display name. This is a translated
name that can be presented to users when discussing the
particular service. For example, a number format, used to
format numbers using the conventions of that locale, can have
translated name for presentation in GUIs.
Prozentformat
...
Where present, the display names must be unique; that is,
two distinct code would not get the same display name.
(There is one exception to this: in time zones, where parsing
results would give the same GMT offset, the standard and
daylight display names can be the same across different time
zone IDs.) Any translations should follow customary practice
for the locale in question. For more information, see [
Data Formats
].
5.1.4 Escaping Characters
Unfortunately, XML does not have the capability to contain
all Unicode code points. Due to this, in certain instances
extra syntax is required to represent those code points that
cannot be otherwise represented in element content. The
escaping syntax is only defined on a few types of elements,
such as in collation or exemplar sets, and uses the appropriate
syntax for that type.
The element
purpose, has been deprecated.
5.2 Common Attributes
5.2.1 Attribute type
The attribute
type
is also used to indicate an
alternate resource that can be selected with a matching
type=option in the locale id modifiers, or be referenced by a
default element. For example:
...
...
...
5.2.2 Attribute draft
If this attribute is present, it indicates the status of all
the data in this element and any subelements (unless they have
a contrary
draft
value), as per the following:
approved:
fully approved by the technical committee
(equals the CLDR 1.3 value of
false
, or an absent
draft
attribute). This does not mean that the data is
guaranteed to be error-free—this is the best judgment of the
committee.
contributed
: partially approved by the technical
committee.
provisional
: partially confirmed. Implementations may
choose to accept the provisional data, especially if there is
no translated alternative.
unconfirmed
: no confirmation available.
For more information on precisely how these values are
computed for any given release, see
Data Submission and Vetting Process
on the CLDR
website.
The draft attribute should only occur on "leaf" elements,
and is deprecated elsewhere. For a more formal description of
how elements are inherited, and what their draft status is, see
Section 4.2 Inheritance
and Validity
5.2.3 Attribute alt
This attribute labels an alternative value for an element.
The value is a
descriptor
indicates what kind of
alternative it is, and takes one of the following
variantname
meaning that the value is a variant of
the normal value, and may be used in its place in certain
circumstances. If a variant value is absent for a particular
locale, the normal value is used. The variant mechanism
should only be used when such a fallback is acceptable.
proposed
, optionally
followed by a number, indicating that the value is a proposed
replacement for an existing value.
variantname
-proposed
, optionally followed by a
number, indicating that the value is a proposed replacement
variant value.
proposed
" should only be
present if the draft status is not "approved". It indicates
that the data is proposed replacement data that has been added
provisionally until the differences between it and the other
data can be vetted. For example, suppose that the translation
for September for some language is "Settembru", and a bug
report is filed that that should be "Settembro". The new data
can be entered in, but marked as
alt="proposed"
until it
is vetted.
...
Now assume another bug report comes in, saying that the
correct form is actually "Settembre". Another alternative can
be added:
...
...
The values for
variantname
at this time include
variant
", "
list
", "
email
", "
www
", "
short
", and "
secondary
".
For a more complete description of how draft applies to
data, see
Section 4.2
Inheritance and Validity
Attribute
references
The value of this attribute is a token representing a
reference for the information in the element, including
standards that it may conform to.
versions of CLDR, the value of the attribute was freeform text.
That format is deprecated.)
Example:
The reference element may be inherited. Thus, for example,
R222 may be used in sv_SE.xml even though it is not defined
there, if it is defined in sv.xml.
<... allow="verbatim" ...> (deprecated)
This attribute was originally intended for use in marking
display names whose capitalization differed from what was
indicated by the now-deprecated
(perhaps, for example, because the names included a proper
noun). It was never supported in the dtd and is not needed for
use with the new
5.3 Common Structures
5.3.1 Date and Date Ranges
When attribute specify date ranges, it is usually done with
attributes
from
and
to
. The
from
attribute
specifies the starting point, and the
to
attribute
specifies the end point. The deprecated
time
attribute
was formerly used to specify time with the deprecated
weekEndStart and weekEndEnd elements, which were themselves
inherently
from
or
to
The data format is a restricted ISO 8601 format, restricted
to the fields
year, month, day, hour, minute,
and
second
in that order, with "-" used as a separator
between date fields, a space used as the separator between the
date and the time fields, and ":" used as a separator between
the time fields. If the minute or minute and second are absent,
they are interpreted as zero. If the hour is also missing, then
it is interpreted based on whether the attribute is
from
or
to
from
defaults to "00:00:00"
(midnight at the start of the day).
to
defaults to "24:00:00" (midnight
at the end of the day).
That is, Friday at 24:00:00 is the same time as
Saturday at 00:00:00. Thus when the hour is missing, the
from and to
are interpreted inclusively: the range
includes all of the day mentioned.
For example, the following are equivalent:
03
00
:00:00" .../>
If the
from
element is missing, it is assumed to be
as far backwards in time as there is data for; if the
to
element is missing, then it is from this point onwards, with no
known end point.
The dates and times are specified in local time, unless
otherwise noted. (In particular, the metazone values are in UTC
(also known as GMT).
5.3.2 Text Directionality
The content of certain elements, such as date or number
formats, may consist of several sub-elements with an inherent
order (for example, the year, month, and day for dates). In
some cases, the order of these sub-elements may be changed
depending on the bidirectional context in which the element is
embedded.
For example, short date formats in languages such as Arabic
may contain neutral or weak characters at the beginning or end
of the element content. In such a case, the overall order of
the sub-elements may change depending on the surrounding
text.
Element content whose display may be affected in this way
should include an explicit direction mark, such as U+200E
LEFT-TO-RIGHT MARK or U+200F RIGHT-TO-LEFT MARK, at the
beginning or end of the element content, or both.
5.3.3 Unicode Sets
Some attribute values or element contents use
UnicodeSet
notation. A UnicodeSet represents a finite
set of Unicode code points and strings, and is defined by lists
of code points and strings, Unicode property sets, and set
operators, all bounded by square brackets. In this context, a
code point means a string consisting of exactly one code
point.
A UnicodeSet implements the semantics in
UTS #18: Unicode
Regular Expressions
UTS18
] Levels
1 & 2 that are relevant to determining sets of characters.
Note however that it may deviate from the syntax provided in
UTS18
], which
is illustrative rather than a requirement. There is one
exception to the supported semantics, Section
RL2.6
Wildcards in Property Values
. That feature can be
supported in clients such as ICU by implementing a “hook” as is
done in the
online UnicodeSet utilities
A UnicodeSet may be cited in specifications outside of the
domain of LDML. In such a case, the specification may specify a
subset of the syntax provided here.
The following provides EBNF syntax for a UnicodeSet:
Symbol
Expression
Examples
root
= prop
| '[-]'
| '[' [\-\^]? s seq+ ']'
\p{x=y},
[abc]
seq
= root (s [\&\-] s root)* s
| range s
[abc]-[cde], a
range
= char ('-' char)?
| '{' (s char)+ s '}'
a, a-c, {abc}
prop
= '\\' [pP] '{' propName ([≠=] s value1+)?
'}'
| '[:' '^'? propName ([≠=] s value2+)? ':]'
\p{x=y}, [:x=y:]
propName
= s [A-Za-z0-9] [A-Za-z0-9_\x20]* s
General_Category,
General Category
value1
= [^\}]
| '\\' quoted
Lm,
\n,
\}
value2
= [^:]
| '\\' quoted
Lm,
\n,
\:
char
= [^\& \- \[ \[ \] \\ \} \{ [:Pat_WS:]]
| '\\' quoted
a, b, c, \n
quoted
= 'u' (hex{4} | bracketedHex)
| 'x' (hex{2} | bracketedHex)
| 'U00' ('0' hex{5} | '10' hex{4})
| 'N{' propName '}'
| [\u0000-\U00010FFFF]
bracketedHex
= '{' s hexCodePoint (s hexCodePoint)* s
'}'
{61 2019 62}
hexCodePoint
= hex{1,5} | '10' hex{4}
hex
= [0-9A-Fa-f]
= [:Pattern_White_Space:]*
optional whitespace
Some constraints on UnicodeSet syntax are not captured by
this EBNF. Notably, property names and values are restricted to
those supported by the implementation.
The syntax characters are listed in the table below:
Char
Hex
Name
Usage
U+0024
DOLLAR SIGN
Equivalent of \uFFFF (This is for implementations
that return \uFFFF when accessing before the first or
after the last character)
U+0026
AMPERSAND
Intersecting UnicodeSets
U+002D
HYPHEN-MINUS
Ranges of characters; also set difference.
U+003A
COLON
POSIX-style property syntax
U+005B
LEFT SQUARE BRACKET
Grouping; POSIX property syntax
U+005D
RIGHT SQUARE BRACKET
Grouping; POSIX property syntax
U+005C
REVERSE SOLIDUS
Escaping
U+005E
CIRCUMFLEX ACCENT
Posix negation syntax
U+007B
LEFT CURLY BRACKET
Strings in set; Perl property syntax
U+007D
RIGHT CURLY BRACKET
Strings in set; Perl property syntax
U+0020 U+0009..U+000D U+0085
U+200E U+200F
U+2028 U+2029
ASCII whitespace,
LRM, RLM,
LINE/PARAGRAPH SEPARATOR
Ignored except when escaped
5.3.3.1 Lists of Code Points
Lists are a sequence of strings that may include ranges,
which are indicated by a '-' between two code points, as in
"a-z". The sequence
start-end
specifies the range of
all code points from the start to end, inclusive, in Unicode
order. For example,
[a c d-f m]
is equivalent to
[a c
d e f m]
. Whitespace can be freely used for clarity, as
[a c d-f m]
means the same as
[acd-fm]
A string with multiple code points is represented in a list
by being surrounded by curly braces, such as in
[a-z
{ch}]
. It can be used with the range notation, as
described in
Section
5.3.4 String
Range
. There is an additional restriction on string
ranges in a UnicodeSet: the number of codepoints in the first
string of the range must be identical to the number in the
second. Thus [{ab}-{c}] and [{ab}-c] are invalid.
In UnicodeSets, there are two ways to quote syntax code
points:
Outside of single quotes, certain
backslashed code point sequences can be used to quote code
points:
\x{h...h}
\u{h...h}
list of 1-6 hex digits ([0-9A-Fa-f]), separated by
spaces
\xhh
1-2 hex digits
\uhhhh
Exactly 4 hex digits
\Uhhhhhhhh
Exactly 8 hex digits
\a
U+0007 (BEL / ALERT)
\b
U+0008 (BACKSPACE)
\t
U+0009 (TAB / CHARACTER TABULATION)
\n
U+000A (LINE FEED)
\v
U+000B (LINE TABULATION)
\f
U+000C (FORM FEED)
\r
U+000D (CARRIAGE RETURN)
\\
U+005C (BACKSLASH / REVERSE SOLIDUS)
\N{name}
The Unicode code point named "name".
\p{…},\P{…}
Unicode property (see below)
Anything else following a backslash is mapped to itself,
except the property syntax described below, or in an
environment where it is defined to have some special
meaning.
Any code point formed as the result of a backslash escape
loses any special meaning and is treated as a literal. In
particular, note that \x, \u and \U escapes create literal code
points. (In contrast, Java treats Unicode escapes as just a way
to represent arbitrary code points in an ASCII source file, and
any resulting code points are
not
tagged as
literals.)
Unicode property sets are defined as described as described
in
UTS #18: Unicode Regular Expressions
UTS18
], Level
1 and RL2.5, including the syntax where given. For an example
of a concrete implementation of this, see [
ICUUnicodeSet
].
5.3.3.2 Unicode Properties
Briefly, Unicode property sets are specified by any Unicode
property and a value of that property, such as
[:General_Category=Letter:]
. for Unicode letters or
\p{uppercase}
is the set of upper case letters in
Unicode. The property names are defined by the
PropertyAliases.txt file and the property values by the
PropertyValueAliases.txt file. For more information, see
UAX44
].
The syntax for specifying the property sets is an extension of
either POSIX or Perl syntax, by the addition of
"=
the POSIX-style syntax:
[:General_Category=Letter:]
or by using the Perl-style syntax
\p{General_Category=Letter}
Property names and values are case-insensitive, and
whitespace, "-", and "_" are ignored. The property name can be
omitted for the
General_Category
and
Script
properties, but is required for other
properties. If the property value is omitted, it is assumed to
represent a boolean property with the value "true". Thus
[:Letter:]
is equivalent to
[:General_Category=Letter:]
, and
[:Wh-ite-s
pa_ce:]
is equivalent to
[:Whitespace=true:]
The table below shows the two kinds of syntax: POSIX and
Perl style. Also, the table shows the "Negative" version, which
is a property that excludes all code points of a given kind.
For example,
[:^Letter:]
matches all code points that
are not
[:Letter:]
Positive
Negative
POSIX-style Syntax
[:type=value:]
[:^type=value:]
Perl-style Syntax
\p{type=value}
\P{type=value}
5.3.3.3 Boolean Operations
The low-level lists or properties then can be freely
combined with the normal set operations (union, inverse,
difference, and intersection):
To union two sets, simply concatenate them. For example,
[[:letter:] [:number:]]
To intersect two sets, use the '&' operator. For
example,
[[:letter:] & [a-z]]
To take the set-difference of two sets, use the '-'
operator. For example,
[[:letter:] - [a-z]]
To invert a set, place a '^' immediately after the
opening '['. For example,
[^a-z]
. In any other
location, the '^' does not have a special meaning. The
inversion [^X] is equivalent to [[\x{0}-\x{10FFFF}]-[X]].
Thus multi-code point strings are discarded.
Symmetric difference (~) is not supported.
The binary operators '&', '-', and the implicit union
have equal precedence and bind left-to-right. Thus
[[:letter:]-[a-z]-[\u0100-\u01FF]]
is equal to
[[[:letter:]-[a-z]]-[\u0100-\u01FF]]
. Another example is
the set
[[ace][bdf] - [abc][def]]
, which is not the
empty set, but instead equal to
[[[[ace] [bdf]] - [abc]]
[def]]
, which equals
[[[abcdef] - [abc]] [def]]
which equals
[[def] [def]]
, which equals
[def]
One caution:
the '&' and '-' operators
operate between sets. That is, they must be immediately
preceded and immediately followed by a set. For example, the
pattern
[[:Lu:]-A]
is illegal, since it is interpreted
as the set
[:Lu:]
followed by the incomplete range
-A
. To specify the set of upper case letters except for
'A', enclose the 'A' in brackets:
[[:Lu:]-[A]]
5.3.3.4 UnicodeSet Examples
The following table summarizes the syntax that can be
used.
Example
Description
[a]
The set containing 'a' alone
[a-z]
The set containing 'a' through 'z' and all letters in
between, in Unicode order.
Thus it is the same as [\u0061-\u007A].
[^a-z]
The set containing all code points but 'a' through
'z'.
Thus it is the same as [\u0000-\u0060
\u007B-\x{10FFFF}].
[[pat1][pat2]]
The union of sets specified by pat1 and pat2
[[pat1]&[pat2]]
The intersection of sets specified by pat1 and
pat2
[[pat1]-[pat2]]
The asymmetric difference of sets specified by pat1 and
pat2
[a {ab} {ac}]
The code point 'a' and the multi-code point strings
"ab" and "ac"
[x\u{61 2019 62}y]
Equivalent to [x\u0061\u201\u0062y] (= [xa’by])
[{ax}-{bz}]
The set containing [{ax} {ay} {az} {bx} {by} {bz}],
using the range syntax to get all the strings from {ax} to
{bz} as described in
Section
5.3.4 String Range
[:Lu:]
The set of code points with a given property value, as
defined by PropertyValueAliases.txt. In this case, these
are the Unicode upper case letters. The long form for this
is
[:General_Category=Uppercase_Letter:]
[:L:]
The set of code points belonging to all Unicode
categories starting with 'L', that is,
[[:Lu:][:Ll:][:Lt:][:Lm:][:Lo:]]
. The long form for
this is
[:General_Category=Letter:]
5.3.4 String Range
A String Range is a compact format for specifying a list of
strings.
Syntax:
sep
The separator and the format of strings X, Y may vary
depending on the domain. For example,
for the validity files the separator is ~,
for UnicodeSet the separator is -, and any
multi-codepoint string is enclosed in {…}.
Validity:
A string range X
sep
Y is valid iff len(X) ≥
len(Y) > 0, where len(X) is the length of X in code
points.
There may be additional, domain-specific requirements
for validity of the expansion of the string range.
Interpretation:
Break X into P and S, where len(S) = len(Y)
Note that P will be an empty string if the lengths of
X and Y are equal.
Form the combinations of all
P+(s₀..y₀)+(s₁..y₁)+...(sₙ..yₙ)
s₀ is the first code point in S, etc.
Examples:
ab-ad
ab ac ad
ab-d
ab ac ad
ab-cd
ab ac ad bb bc bd cb cc cd
👦🏻-👦🏿
👦🏻 👦🏼 👦🏽 👦🏾 👦🏿
👦🏻-🏿
👦🏻 👦🏼 👦🏽 👦🏾 👦🏿
5.4 Identity Elements
generation?, language, script?, territory?, variant?, special*)
) >
The identity element contains information identifying the
target locale for this data, and general information about the
version of this data.
The version element provides, in an attribute, the version
of this file. The contents of the element can contain
textual notes about the changes between this version and the
last. For example:
Various notes and changes in version 1.1
This is not to be confused with the version attribute on
the ldml element, which tracks the dtd version.
The generation element is now deprecated. It was used to
contain the last modified date for the data. This could be in
two formats: ISO 8601 format, or CVS format (illustrated by the
example above).
The language code is the primary part of the specification
of the locale id, with values as described above.
The script code may be used in the identification of written
languages, with values described above.
The territory code is a common part of the specification of
the locale id, with values as described above.
The variant code is the tertiary part of the specification
of the locale id, with values as described above.
When combined according to the rules described in
Section
3, Unicode Language and Locale Identifiers
, the
language element, along with any of the optional script,
territory, and variant elements, must identify a known, stable
locale identifier. Otherwise, it is an error.
5.5 Valid
Attribute Values
The
DTD Annotations
in Section 5.7 are used to determine whether elements, attributes, or attribute values are valid (or deprecated).
5.6 Canonical Form
The following are restrictions on the format of LDML files
to allow for easier parsing and comparison of files.
Peer elements have consistent order. That is, if the DTD or
this specification requires the following order in an element
foo
It can never require the reverse order in a different
element
bar
Note that there was one case that had to be corrected in
order to make this true. For that reason, pattern occurs twice
under currency:
decimal?, group?, special*)) >
XML
files can
have a wide variation in textual form, while representing
precisely the same data. By putting the LDML files in the
repository into a canonical form, this allows us to use the
simple diff tools used widely (and in CVS) to detect
differences when vetting changes, without those tools being
confused. This is not a requirement on other uses of LDML; just
simply a way to manage repository data more easily.
5.6.1
Content
All start elements are on their own line, indented by
depth
tabs.
All end elements (except for leaf nodes) are on their own
line, indented by
depth
tabs.
Any leaf node with empty content is in the form
There are no blank lines except within comments or
content.
Spaces are used within a start element. There are no
extra spaces within elements.
, not
, not
All attribute values use double quote ("), not single
(').
There are no CDATA sections, and no escapes except those
absolutely required.
no ' since it is not necessary
no 'a', it would be just 'a'
All attributes with defaulted values are suppressed.
The draft and alt="proposed.*" attributes are only on
leaf elements.
The tzid are canonicalized in the following way:
All tzids as of as CLDR 1.1 (2004.06.08) in
zone.tab are canonical.
After that point, the first time a tzid is
introduced, that is the canonical form.
That is, new IDs are added, but existing ones keep the
original form. The
TZ
timezone database keeps a set
of equivalences in the "backward" file. These are used to
map other tzids to the canonical form. For example, when
America/Argentina/Catamarca
was introduced as
the new name for the previous
America/Catamarca
, a link was added in the
backward file.
Link America/Argentina/Catamarca
America/Catamarca
Example:
5.6.2
Ordering
An element is ordered first by the element name, and then if
the element names are identical, by the sorted set of
attribute-value pairs. For the latter, compare the first pair
in each (in sorted order by attribute pair). If not identical,
go to the second pair, and so on.
Elements and attributes are ordered according to their order
in the respective DTDs. Attribute value comparison is a bit
more complicated, and may depend on the attribute and type.
This is currently done with specific ordering tables.
Any future additions to the DTD must be structured so as to
allow compatibility with this ordering. See also
Section 5.5 Valid Attribute
Values.
5.6.3
Comments
Comments are of the form .
They are logically attached to a node. There are 4 kinds:
Inline always appear after a leaf node, on the same
line at the end. These are a single line.
Preblock comments always precede the attachment node,
and are indented on the same level.
Postblock comments always follow the attachment node,
and are indented on the same level.
Final comment, after
Multiline comments (except the final comment) have each
line after the first indented to one deeper level.
Examples:
...
...
5.7 DTD Annotations
The information in a standard DTD is insufficient for use in
CLDR. To make up for that, DTD annotations are added. These are
of the form
and are included below the !ELEMENT or !ATTLIST line that they
apply to. The current annotations are:
Type
Description
The attribute is not distinguishing, and is treated
like an element value
The attribute is a “comment” on the data, like the
draft status. It is not typically used in
implementations.
The element's children are ordered, and do not
inherit.
The element or attribute is deprecated, and should not
be used.
The attribute values are deprecated, and should not be
used. Spaces between tokens are not significant.
Requires the attribute value to match the constraint.
There is additional information in the
attributeValueValidity.xml file that is used internally for
testing. For example, the following line indicates that the
'currency' element in the ldml dtd must have values from the
bcp47 'cu' type.
attributes='type'>$_bcp47_cu
The element values may be literals, regular expressions, or
variables (some of which are set programmatically according to
other CLDR data, such as the above. However, the information as
this point does not cover all attribute values, is used only
for testing, and should not be used in implementations since
the structure may change without notice.
5.7.1
Attribute Value Constraints
The following are constraints on the attribute values. Note: in future versions, the format may change, and/or the constaints may be tightened.
Constraint
Comments
any
any string value
any/TODO
placeholder for future constraints
bcp47/anykey
any bcp47 key or tkey
bcp47/anyvalue
any bcp47 value (type) or tvalue
literal/{literal values}
comma separated
regex/{regex expression}
valid regex expression
bcp47/{key or tkey}
matches possible values for that key or tkey
metazone
valid metazone
range/{start_number~{end_number}}
number between (inclusive) start and end
time/{time or date or date-time pattern}
eg HH:mm
unicodeset/{unicodeset pattern}
valid unicodeset
validity/{field}
currency, language, locale, region, script, subdivision, unit, variant
version
1 to 4 digit field version, such as 35.3.9
set/{match}
set of elements that match {match}
or/{match1}XX{match2}…
matches at least one of {match1}, etc
6 Property Data
Some data in CLDR does not use an XML format, but rather a
semicolon-delimited format derived from that of the Unicode
Character Database. That is because the data is more likely to
be parsed by implementations that already parse UCD data. Those
files are present in the common/properties directory.
Each file has a header that explains the format and usage of
the data.
6.1 Script Metadata
scriptMetadata.txt
This file provides general information about scripts that
may be useful to implementations processing text. The
information is the best currently available, and may change
between versions of CLDR. The format is similar to Unicode
Character Database property file, and is documented in the
header of the data file.
6.2
Extended Pictographic
ExtendedPictographic.txt
This file was used to define the ExtendedPictographic data
used for “future-proofing” emoji behavior, especially in
segmentation. As of Emoji version 11.0, the set of
Extended_Pictographic is incorporated into the emoji data files
found at
unicode.org/Public/emoji/
6.3
Labels.txt
labels.txt
This file provides general information about associations of
labels to characters that may be useful to implementations of
character-picking applications. The information is the best
currently available, and may change between versions of CLDR.
The format is similar to Unicode Character Database property
file, and is documented in the header of the data file.
Initially, the contents are focused on emoji, but may be
expanded in the future to other types of characters. Note that
a character may have multiple labels.
6.4
Segmentation Tests
CLDR provides a tailoring to the
Grapheme Cluster Break (gcb)
algorithm to avoid splitting Indic aksaras. The corresponding test files for that are located in common/properties/segments/, along with a readme.txt that provides more details. There are also specific test files for the supported Indic scripts in the unittest directory.
7 Issues in Formatting and
Parsing
7.1 Lenient Parsing
7.1.1 Motivation
User input is frequently messy. Attempting to parse it by
matching it exactly against a pattern is likely to be
unsuccessful, even when the meaning of the input is clear to a
human being. For example, for a date pattern of "MM/dd/yy", the
input "June 1, 2006" will fail.
The goal of lenient parsing is to accept user input whenever
it is possible to decipher what the user intended. Doing so
requires using patterns as data to guide the parsing process,
rather than an exact template that must be matched. This
informative section suggests some heuristics that may be useful
for lenient parsing of dates, times, and numbers.
7.1.2 Loose Matching
Loose matching ignores attributes of the strings being
compared that are not important to matching. It involves the
following steps:
Remove "." from currency symbols and other fields used
for matching, and also from the input string unless:
"." is in the decimal set, and
its position in the input string is immediately
before a decimal digit
Ignore all format characters: in particular, ignore any
RLM, LRM or ALM used to control BIDI formatting.
Ignore all characters in [:Zs:] unless they occur between
letters. (In the heuristics below, even those between letters
are ignored except to delimit fields)
Map all characters in [:Dash:] to U+002D
HYPHEN-MINUS
Use the data in the
map equivalent characters (for example, curly to straight
apostrophes). Other apostrophe-like characters should also be
treated as equivalent, especially if the character actually
used in a format may be unavailable on some keyboards. For
example:
U+02BB MODIFIER LETTER TURNED COMMA (ʻ) might be
typed instead as U+2018 LEFT SINGLE QUOTATION MARK
(‘).
U+02BC MODIFIER LETTER APOSTROPHE (ʼ) might be typed
instead as U+2019 RIGHT SINGLE QUOTATION MARK (’), U+0027
APOSTROPHE, etc.
U+05F3 HEBREW PUNCTUATION GERESH (׳) might be typed
instead as U+0027 APOSTROPHE.
Apply mappings particular to the domain (i.e., for dates
or for numbers, discussed in more detail below)
Apply case folding (possibly including language-specific
mappings such as Turkish i)
Normalize to NFKC; thus
no-break space
will map to
space
; half-width
katakana
will map to
full-width.
Loose matching involves (logically) applying the above
transform to both the input text and to each of the field
elements used in matching, before applying the specific
heuristics below. For example, if the input number text is " -
NA f. 1,000.00", then it is mapped to "-naf1,000.00" before
processing. The currency signs are also transformed, so "NA f."
is converted to "naf" for purposes of matching. As with other
Unicode algorithms, this is a logical statement of the process;
actual implementations can optimize, such as by applying the
transform incrementally during matching.
7.2 Handling Invalid Patterns
Processes sometimes encounter invalid number or date
patterns, such as a number pattern with “¤¤¤¤¤” (valid pattern
character but invalid length in current CLDR), a date pattern
with “nn” (invalid pattern character in current CLDR), or a
date pattern with “MMMMMM” (invalid length in current CLDR).
The recommended behavior for handling such an invalid pattern
field is:
For a field using a currently-invalid length for a valid
pattern character:
In
formatting,
emit U+FFFD
REPLACEMENT CHARACTER for the invalid field.
In
parsing,
the field may be parsed
as if it had a valid length.
For a pattern that contains a currently-invalid pattern
character (applies only to date patterns, for which A-Za-z
are reserved as pattern characters but not all defined as
valid):
Produce an error (set an error code or throw an
exception) when an attempt is made to create a formatter
with such a pattern or to apply such a pattern to an
existing formatter.
Annex A Deprecated Structure
The
DTD Annotations
in Section 5.7 are used to determine whether elements, attributes, or attribute values are deprecated.
While valid LDML, they are strongly
discouraged, and no longer used in CLDR.
The remainder of this section describes selected cases of
deprecated structure that were present in previous versions of
CLDR.
A.1 Element fallback
The fallback element is deprecated. Implementations should
use instead the information in
Section 4.4 Language Matching
for
doing language fallback.
A.2 BCP 47
Keyword Mapping
Note:
This structure is deprecated and replaced
with
Section
3.6.4 U Extension Data Files
mapTypes* ) >
This section defines mappings between old Unicode locale
identifier key/type values and their BCP 47 'u' extension
subtag representations. The 'u' extension syntax described in
Section 3.6 Unicode BCP 47 U
Extension
restricts a key to two ASCII alphanumerics and a
type to three to eight ASCII alphanumerics. A key or a type
which does not meet that syntax requirement is converted
according to the mapping data defined by the mapKeys or
mapTypes elements. For example, a keyword "collation=phonebook"
is converted to BCP 47 'u' extension subtags "co-phonebk" by
the mapping data below:
...
...
...
...
A.3 Choice Patterns
Note:
This structure is deprecated and replaced
with count attributes.
A choice pattern is a string that chooses among a number of
strings, based on numeric value. It has the following form:
)*
'∞' | [0-9]+ ('.'
[0-9]+)?)
≤'
The interpretation of a choice pattern is that given a
number N, the pattern is scanned from right to left, for each
choice evaluating
choice that matches results in the corresponding string. If no
match is found, then the first string is used. For example:
Pattern
Result
0≤Rf|1≤Ru|1
-3, -1,
-0.000001
Rf (defaulted to first string)
0, 0.01, 0.9999
Rf
Ru
1.00001, 5, 99,
Re
Quoting is done using ' characters, as in date or number
formats.
A.4 Element default
Note:
This structure is deprecated.
Use
replacement structure instead, for example:
For
For
locale is now specified by
Calendar
Preference Data
In some cases, a number of elements are present. The default
element can be used to indicate which of them is the default,
in the absence of other information. The value of the choice
attribute is to match the value of the type attribute for the
selected item.
h:mm:ss a z
h:mm:ss a z
h:mm:ss a
...
Like all other elements, the
inherited. Thus, it can also refer to inherited resources. For
example, suppose that the above resources are present in fr,
and that in fr_BE we have the following:
In that case, the default time format for fr_BE would be the
inherited "long" resource from fr. Now suppose that we had in
fr_CA:
...
In this case, the
has the value "medium". It thus refers to this new "medium"
pattern in this resource bundle.
A.5 Deprecated Common
Attributes
A.5.1 Attribute standard
Note:
This attribute is deprecated.
Instead, use a reference element with the attribute
standard="true".
The value of this attribute is a list of strings
representing standards: international, national, organization,
or vendor standards. The presence of this attribute indicates
that the data in this element is compliant with the indicated
standards. Where possible, for uniqueness, the string should be
a URL that represents that standard. The strings are separated
by commas; leading or trailing spaces on each string are not
significant. Examples:
...
A.5.2
Attribute draft in non-leaf elements
The draft attribute is deprecated except in leaf elements
(elements that do not have any subelements)
A.6 Element base
Note:
This element is deprecated.
Use the
collation
The optional base element
...
, contains an
alias element that points to another data source that defines a
base
collation. If present, it indicates that the
settings and rules in the collation are modifications applied
on
top of the
respective elements in the base collation.
That is, any successive settings, where present, override what
is in the base as described in
Setting Options
. Any
successive rules are concatenated to the end of the rules in
the base. The results of multiple rules applying to the same
characters is covered in
Orderings
A.7 Element rules
Note:
The XML collation syntax is deprecated; this
includes the
that the
subelement of
Use the basic collation
syntax with the
element
instead.
), ( reset | import | p | pc | s | sc | t | tc | i | ic | x)*
)) >
A.8 Deprecated subelements of
A.9 Deprecated
subelements of
forms are specified in the
monthNames, monthAbbr are equivalent to: using the months
element with the context type="
format
" and the width type="
wide
" (for ...Names) and
type="
narrow
" (for ...Abbr),
respectively.
are specified in the
dayNames, dayAbbr are equivalent to: using the days element
with the context type="
format
" and the width type="
wide
" (for ...Names) and
type="
narrow
" (for ...Abbr),
respectively.
is
deprecated in the main LDML files, because the data is more
appropriately organized as connected to territories, not to
linguistic data. Use the supplemental
element instead.
of the
located just under a
Calendar Fields
A.10 Deprecated
subelements of
(deprecated), e.g. "{0} Time ({1})" for "United States
Time (New York)"
modern zones; use metazones instead.
Primary Zones
A.11 Deprecated
subelements of
zone was commonly used in the locale.
A.12
Renamed attribute values for
element
The
CLDR 21. The values for its
type
attribute are
documented in
. In
CLDR 25, some of these values were renamed from their previous
values for improved clarity:
"type" was renamed to "keyValue"
"displayName" was renamed to "currencyName"
"displayName-count" was renamed to
"currencyName-count"
"tense" was renamed to "relative"
A.13 Deprecated
subelements of
and replaced with
A.14 Element cp
The cp element was used to escape characters that cannot be
represented in XML, even with NCRs. These escapes were only
allowed in certain elements, according to the DTD.
However, this mechanism is very clumsy, and was replaced by
specialized syntax.
Code Point
XML Example
U+0000
A.15 Attribute validSubLocales
The attribute
validSubLocales
allowed sublocales in a
given tree to be treated as though a file for them were present
when there was not one. It only had an effect for locales that
inherit from the current file where a file is missing.
Example 1.
Suppose that in a particular LDML tree,
there are no region locales for German, for example, there is a
de.xml file, but no files for de_AT.xml, de_CH.xml, or
de_DE.xml. Then no elements are valid for any of those region
locales. If we want to mark one of those files as having valid
elements, then we introduce an empty file, such as the
following.
With the
validSubLocales
attribute, instead of adding
the empty files for de_AT.xml, de_CH.xml, and de_DE.xml, in the
de file we could add to the parent locale a list of the child
locales that should behave as if files were present.
...
Now that the
validSubLocales
attribute has been
deprecated, it is recommended to simply add empty files to
specify which sublocales are valid. This convention is used
throughout the CLDR.
A.16 Elements postalCodeData,
postCodeRegex
The postal code validation data has been deprecated. Please
see other services that are kept up to date, such as:
...
See
Postal
Code Validation
A.17 Element telephoneCodeData
The element
have been deprecated and the data removed.
Annex B Links to Other Parts
The LDML specification is split into several
parts
by topic, with one HTML document per part.
The following tables provide redirects for links to specific
topics. Please update your links and bookmarks.
Part 1 Links: Core (this document): No redirects needed.
Part 2 Links
General
(display names &
transforms, etc.)
Old section
Section in new part
5.4
Display
Name Elements
Display Name
Elements
5.5
Layout Elements
Layout
Elements
5.6
Character
Elements
Character
Elements
5.6.1
Exemplar Syntax
3.1
Exemplar
Syntax
5.6.2 Restrictions
3.1
Exemplar
Syntax
5.6.3 Mapping
3.2
Mapping
5.6.4
Index Labels
3.3
Index
Labels
5.6.5 Ellipsis
3.4
Ellipsis
5.6.6 More Information
3.5
More
Information
5.7
Delimiter
Elements
Delimiter
Elements
C.6
Measurement System Data
Measurement
System Data
5.8
Measurement Elements
(deprecated)
5.1
Measurement
Elements (deprecated)
5.11
Unit Elements
Unit
Elements
5.12
POSIX Elements
POSIX
Elements
5.13
Reference
Element
Reference
Element
5.15
Segmentations
Segmentations
5.15.1
Segmentation
Inheritance
9.1
Segmentation
Inheritance
5.16
Transforms
10
Transforms
Transform Rules
10.3
Transform Rules
Syntax
5.18
List Patterns
11
List
Patterns
C.20
Gender of Lists
11.1
Gender of
Lists
5.19
ContextTransform
Elements
12
ContextTransform
Elements
Part 3 Links
Numbers
(number & currency
formatting)
Old section
Section in new part
C.13
Numbering
Systems
Numbering
Systems
5.10
Number Elements
Number
Elements
5.10.1
Number Symbols
2.3
Number
Symbols
Number Format Patterns
Number Format
Patterns
5.10.2
Currencies
Currencies
C.1
Supplemental Currency
Data
4.1
Supplemental
Currency Data
C.11
Language Plural Rules
Language Plural
Rules
5.17
Rule-Based Number
Formatting
Rule-Based
Number Formatting
Part 4 Links
Dates
(date, time, time zone
formatting)
Old section
Section in new part
5.9 Date Elements
Overview:
Dates Element, Supplemental Date and Calendar
Information
5.9.1 Calendar Elements
Calendar
Elements
Elements months, days,
quarters, eras
2.1
Elements
months, days, quarters, eras
Elements monthPatterns,
cyclicNameSets
2.2
Elements
monthPatterns, cyclicNameSets
Element dayPeriods
2.3
Element
dayPeriods
Element dateFormats
2.4
Element
dateFormats
Element timeFormats
2.5
Element
timeFormats
Element dateTimeFormats
2.6
Element
dateTimeFormats
5.9.2 Calendar Fields
Calendar
Fields
5.9.3
Time Zone Names
Time Zone
Names
C.5 Supplemental Calendar
Data
Supplemental
Calendar Data
C.7 Supplemental Time Zone
Data
Supplemental
Time Zone Data
C.15 Calendar Preference
Data
4.2
Calendar
Preference Data
C.17 DayPeriod Rules
4.5
Day
Period Rules
Appendix
F: Date Format Patterns
Date
Format Patterns
Date Field Symbol Table
Date
Field Symbol Table
F.1 Localized Pattern
Characters (deprecated)
8.1
Localized
Pattern Characters (deprecated)
Appendix J: Time Zone Display
Names
Using
Time Zone Names
fallbackFormat
fallbackFormat
O.4 Parsing Dates and Times
Parsing
Dates and Times
Part 5 Links
Collation
(sorting, searching,
grouping)
Old section
Section in new part
5.14
Collation
Elements
Collation
Tailorings
5.14.1
Version
3.1
Version
5.14.2
Collation
Element
3.2
Collation
Element
5.14.3
Setting
Options
3.3
Setting
Options
Table
Collation
Settings
Table
Collation
Settings
5.14.4
Collation Rule Syntax
3.4
Collation Rule
Syntax
5.14.5
Orderings
3.5
Orderings
5.14.6
Contractions
3.6
Contractions
5.14.7
Expansions
3.7
Expansions
5.14.8
Context Before
3.8
Context
Before
5.14.9
Placing Characters
Before Others
3.9
Placing
Characters Before Others
5.14.10
Logical Reset Positions
3.10
Logical Reset
Positions
5.14.11
Special-Purpose
Commands
3.11
Special-Purpose
Commands
5.14.12
Collation
Reordering
3.12
Collation
Reordering
5.14.13
Case
Parameters
3.13
Case
Parameters
Definition:
UncasedExceptions
removed: see 3.13
Case
Parameters
Definition:
LowerExceptions
removed: see 3.13
Case
Parameters
Definition:
UpperExceptions
removed: see 3.13
Case
Parameters
5.14.14
Visibility
3.14
Visibility
Part 6 Links
Supplemental
(supplemental data)
Old section
Section in new part
Supplemental Data
Introduction
Supplemental
Data
C.2
Supplemental Territory
Containment
1.1
Supplemental
Territory Containment
C.4
Supplemental Territory
Information
1.2
Supplemental
Territory Information
C.3
Supplemental Language
Data
Supplemental
Language Data
C.9
Supplemental Code
Mapping
Supplemental
Code Mapping
C.12
Telephone
Code Data
Telephone Code
Data
C.14
Postal Code Validation
Postal Code
Validation
C.8
Supplemental
Character Fallback Data
Supplemental
Character Fallback Data
Coverage Levels
Coverage
Levels
5.20
Metadata Elements
10
Locale
Metadata Element
Supplemental
Metadata
P.1
Supplemental Alias
Information
P.2
Supplemental
Deprecated Information
P.3
Default Content
Supplemental
Metadata
9.1
Supplemental
Alias Information
9.2
Supplemental
Deprecated Information
9.3
Default
Content
Part 7 Links
Keyboards
(keyboard mappings)
Old section
Section in new part
Keyboards
Keyboards
Goals and
Nongoals
Goals
and Nongoals
File
and Directory Structure
File and
Directory Structure
Element Hierarchy - Layout
File
Element
Hierarchy - Layout File
Element Hierarchy -
Platform File
Element
Hierarchy - Platform File
Invariants
Invariants
Data Sources
Data
Sources
Keyboard IDs
Keyboard
IDs
Platform Behaviors in
Edge Cases
Platform
Behaviors in Edge Cases
Element: keyboard
Element:
keyboard
Element: version
Element:
version
Element:
generation
Element:
generation
Element: names
Element:
names
Element: name
Element:
name
Element: settings
Element:
settings
Element: keyMap
Element:
keyMap
Element: map
Element:
map
Element:
transforms
Element:
transforms
Element: transform
Element:
transform
Element: platform
Element:
platform
Element:
hardwareMap
Element:
hardwareMap
Principles for Keyboard
Ids
Principles
for Keyboard Ids
References
Ancillary Information
To properly localize,
parse, and format data requires ancillary information,
which is not expressed in Locale Data Markup Language. Some
of the formats for values used in Locale Data Markup
Language are constructed according to external
specifications. The sources for this data and/or formats
include the following:
Bugs
CLDR Bug Reporting
form
Charts
The online code charts can
be found at
An index to character names with links to the corresponding
chart is found at
DUCET
The Default Unicode
Collation Element Table (DUCET)
For the base-level collation, of which all the collation
tables in this document are tailorings.
FAQ
Unicode
Frequently Asked Questions
For answers to common questions on technical
issues.
FCD
As defined in UTN #5
Canonical Equivalences in Applications
Glossary
Unicode Glossary
For explanations of
terminology used in this and other documents.
JavaChoice
Java ChoiceFormat
Olson
The
TZ
ID Database
(aka Olson timezone database)
Time zone and daylight savings information.
For archived data, see
ftp://ftp.iana.org/tz/releases/
Reports
Unicode Technical
Reports
For information on the status and development
process for technical reports, and for a list of technical
reports.
Unicode
The Unicode Consortium.
The Unicode Standard, Version
7.0.0
, (Mountain View, CA: The Unicode
Consortium, 2014. ISBN 978-1-936213-09-2)
Versions
Versions of the Unicode
Standard
For information on version numbering, and citing and
referencing the Unicode Standard, the Unicode Character
Database, and Unicode Technical Reports.
XPath
Other Standards
Various standards
define codes that are used as keys or values in Locale Data
Markup Language. These include:
BCP47
The Registry
ISO639
ISO Language Codes
Actual List
ISO1000
ISO 1000: SI units and
recommendations for the use of their multiples and of
certain other units, International Organization for
Standardization, 1992.
ISO3166
ISO Region Codes
Actual List
ISO4217
ISO Currency Codes
(Note that as of this point, there are significant
problems with this list. The supplemental data file
contains the best compendium of currency information
available.)
ISO8601
ISO Date and Time
Format
ISO15924
ISO Script Codes
Actual List
LOCODE
United Nations Code for
Trade and Transport Locations, commonly known as
"UN/LOCODE"
Download at:
RFC6067
BCP 47 Extension U
RFC6497
BCP 47 Extension T -
Transformed Content
UNM49
UN M.49: UN Statistics Division
Country or area & region codes
Composition of macro geographical (continental)
regions, geographical sub-regions, and selected economic
and other groupings
XML Schema
W3C XML Schema
General
The following are
general references from the text:
ByType
CLDR Comparison Charts
Calendars
Calendrical Calculations:
The Millennium Edition by Edward M. Reingold, Nachum
Dershowitz; Cambridge University Press; Book and CD-ROM
edition (July 1, 2001); ISBN: 0521777526. Note that the
algorithms given in this book are copyrighted.
Comparisons
Comparisons between locale
data from different sources
CurrencyInfo
UNECE Currency Data
DataFormats
CLDR Translation
Guidelines
Example
A sample in Locale Data
Markup Language
ICUCollation
ICU rule syntax
ICUTransforms
Transforms
Transforms Demo
ICUUnicodeSet
ICU UnicodeSet
API
ITUE164
International
Telecommunication Union: List Of ITU Recommendation E.164
Assigned Country Codes
available at
LocaleExplorer
ICU Locale Explorer
LocaleProject
Common Locale Data
Repository Project
NamingGuideline
OpenI18N Locale Naming
Guideline
formerly at
RBNF
Rule-Based Number
Format
RBBI
Rule-Based Break
Iterator
UCAChart
Collation Chart
UTCInfo
NIST Time and Frequency
Division Home Page
U.S. Naval Observatory: What is Universal Time?
WindowsCulture
Windows Culture Info
(with mappings from [
BCP47
]-style codes to LCIDs)
Acknowledgments
Special thanks to the following people for their continuing
overall contributions to the CLDR project, and for their
specific contributions in the following areas. These
descriptions only touch on the many contributions that they
have made.
Mark
Davis for creating the initial version of LDML, and
adding to and maintaining this specification, and for his
work on the LDML code and tests, much of the supplemental
data and overall structure, and transforms and
keyboards.
John Emmons for the POSIX conversion tool and
metazones.
Deborah Goldsmith for her contributions to LDML
architecture and this specification.
Chris Hansten for coordinating and managing data
submissions and vetting.
Erkki Kolehmainen and his team for their work on
Finnish.
Steven R. Loomis for development of the survey tool and
database management.
Peter Nugent for his contributions to the POSIX tool and
from Open Office, and for coordinating and managing data
submissions and vetting.
George Rhoten for his work on currencies.
Roozbeh Pournader (روزبه پورنادر) for his work on South
Asian countries.
Ram Viswanadha (రఘురామ్ విశ్వనాధ) for all of his work on
LDML code and data integration, and for coordinating and
managing data submissions and vetting.
Vladimir Weinstein (Владимир Вајнштајн) for his work on
collation.
Yoshito Umaoka (馬岡 由人) for his work on the timezone
architecture.
Rick McGowan for his work gathering language, script and
region data.
Xiaomei Ji (吉晓梅) for her work on time intervals and
plural formatting.
David Bertoni for his contributions to the conversion
tools.
Mike Tardif for reviewing this specification and for
coordinating and vetting data submissions.
Peter Edberg for work on this specification,
monthPatterns, cyclicNameSets, contextTransforms and other
items.
Raymond Wainman and Cibu Johny for their work on
keyboards.
Jennifer Chye for her contributions to the conversion
tools.
Markus Scherer for a major rewrite of Part 5, Collation.
Shane Carr
for his work on numbers and measurement units.
Other contributors to CLDR are listed on the
CLDR Project Page
Modifications
Revision 57
Part 1:
Core
(languages, locales, basic structure)
Section 3.2
Unicode Locale Identifier
Clarified differences between Unicode locale identifiers and RFC 6067. [
CLDR-11770
Part 2:
General
(display names &
transforms, etc.)
Section 11
List Patterns
Add entries for standard-narrow and or-narrow patterns.
CLDR-13301
Section 14.1
Synthesizing Sequence Names
Document changes in the Emoji derived name algorithm.
CLDR-11952
Part 3:
Numbers
(number & currency
formatting)
no changes
Part 4:
Dates
(date, time, time zone formatting)
no changes
Part 5:
Collation
(sorting,
searching, grouping)
no changes
Part 6:
Supplemental
(supplemental
data)
no changes
Part 7:
Keyboards
(keyboard
mappings)
no changes
Modifications in previous versions are listed in those
respective versions. Click on
Previous Version
in the header until you get to the desired version.
Copyright © 2001–2019 Unicode, Inc. All
Rights Reserved. The Unicode Consortium makes no expressed or
implied warranty of any kind, and assumes no liability for
errors or omissions. No liability is assumed for incidental and
consequential damages in connection with or arising out of the
use of the information or programs contained or accompanying
this technical report. The Unicode
apply.
Unicode and the Unicode logo are
trademarks of Unicode, Inc., and are registered in some
jurisdictions.