UTS #35: Unicode Locale Data Markup Lang

UTS #35: Unicode Locale Data Markup Language
Technical
Reports
Unicode Technical
Standard #35
Unicode Locale Data Markup
Language (LDML)
Version
30
Editors
Mark Davis
markdavis@google.com
and
other CLDR committee
members
Date
2016-10-05
This Version
Previous Version
Latest Version
Corrigenda
Latest Proposed Update
Namespace
DTDs
Revision
45
Summary
This document describes an XML format (
vocabulary
) for the
exchange of structured locale data. This format is used in the
Unicode Common Locale Data
Repository
Status
This document has been reviewed by Unicode members and other
interested parties, and has been approved for publication by the
Unicode Consortium. This is a stable document and may be used as
reference material or cited as a normative reference by other
specifications.
A Unicode Technical Standard (UTS)
is an independent
specification. Conformance to the Unicode Standard does not imply
conformance to any UTS.
Please submit corrigenda and other comments with the CLDR bug
reporting form [
Bugs
].
Related information that is useful in understanding this document is
found in the
References
. For the latest
version of the Unicode Standard see [
Unicode
]. For a
list of current Unicode Technical Reports see [
Reports
]. For more
information about versions of the Unicode Standard, see [
Versions
].
Parts
The LDML specification is divided into the following parts:
Part 1:
Core
(languages,
locales, basic structure)
Part 2:
General
(display names & transforms, etc.)
Part 3:
Numbers
(number & currency formatting)
Part 4:
Dates
(date,
time, time zone formatting)
Part 5:
Collation
(sorting, searching, grouping)
Part 6:
Supplemental
(supplemental data)
Part 7:
Keyboards
(keyboard mappings)
Contents of Part 1, Core
Introduction
1.1
Conformance
What is a Locale?
Unicode Language and Locale
Identifiers
3.1
Unicode
Language Identifier
3.2
Unicode
Locale Identifier
3.3
BCP 47 Conformance
3.3.1
BCP
47 Language Tag Conversion
3.4
Language Identifier
Field Definitions
Table:
Language
Identifier Field Definitions
3.5
Special Codes
3.5.1
Unknown
or Invalid Identifiers
3.5.2
Numeric Codes
3.5.3
Private Use Codes
Table:
Private Use
Codes in CLDR
3.6
Unicode
BCP 47 U Extension
3.6.1
Key And
Type Definitions
Table:
Key/Type
Definitions
3.6.2
Numbering
System Data
3.6.3
Time Zone
Identifiers
3.6.4
Extension Data Files
3.6.4.1
CODEPOINTS
3.6.4.2
REORDER_CODE
3.6.4.3
RG_KEY_VALUE
3.6.4.4
SUBDIVISION_CODE
3.6.4.5
PRIVATE_USE
3.6.5
Subdivision
Codes
3.6.5.1
Validity
3.7
Unicode BCP 47 T Extension
3.7.1
Extension Data Files
3.8
Compatibility
with Older Identifiers
3.8.1
Old
Locale Extension Syntax
Table:
Locale
Extension Mappings
3.8.2
Legacy Variants
Table:
Legacy
Variant Mappings
3.8.3
Relation to
OpenI18n
3.9
Transmitting
Locale Information
3.9.1
Message
Formatting and Exceptions
3.10
Unicode
Language and Locale IDs
3.10.1
Written Language
3.11
Validity Data
Locale Inheritance and
Matching
4.1
Lookup
4.1.1
Bundle vs
Item Lookup
Table:
Lookup
Differences
4.1.2
Lateral
Inheritance
Table:
Count
Fallback: normal
Table:
Count
Fallback: currency
4.1.3
Parent Locales
4.2
Inheritance
and Validity
4.2.1
Definitions
4.2.2
Resolved Data
File
4.2.3
Valid Data
4.2.4
Checking
for Draft Status
4.2.5
Keyword
and Default Resolution
4.3
Likely Subtags
4.4
Language Matching
XML Format
5.1
Common Elements
5.1.1
Element special
5.1.1.1
Sample
Special Elements
5.1.2
Element alias
Table:
Inheritance
with source="locale"
5.1.3
Element
displayName
5.1.4
Escaping
Characters
5.2
Common Attributes
5.2.1
Attribute type
5.2.2
Attribute draft
5.2.3
Attribute alt
5.3
Common Structures
5.3.1
Date and Date Ranges
5.3.2
Text
Directionality
5.3.3
Unicode Sets
5.3.3.1
Lists of
Code Points
5.3.3.2
Unicode
Properties
5.3.3.3
Boolean
Operations
5.3.3.4
UnicodeSet
Examples
5.3.4
String Range
5.4
Identity Elements
5.5
Valid Attribute
Values
5.6
Canonical Form
5.6.1
Content
5.6.2
Ordering
5.6.3
Comments
5.7
DTD Annotations
Property Data
6.1
Script Metadata
6.2
Extended Pictographic
Issues in Formatting
and Parsing
7.1
Lenient Parsing
7.1.1
Motivation
7.1.2
Loose Matching
7.2
Handling Invalid
Patterns
Deprecated Structure
8.1
Element fallback
8.2
BCP 47 Keyword
Mapping
8.3
Choice Patterns
8.4
Element default
8.5
Deprecated
Common Attributes
8.5.1
Attribute
standard
8.5.2
Attribute
draft in non-leaf elements
8.6
Element base
8.7
Element rules
8.8
Deprecated
subelements of
8.9
Deprecated
subelements of
8.10
Deprecated
subelements of
8.11
Deprecated
subelements of and
8.12
Renamed
attribute values for element
8.13
Deprecated
subelements of
8.14
Element cp
8.15
Attribute
validSubLocales
8.16
Elements
postalCodeData, postCodeRegex
Links to Other Parts
Table:
Part 2 Links: General
(display names & transforms, etc.)
Table:
Part 3 Links: Numbers
(number & currency formatting)
Table:
Part 4 Links: Dates
(date, time, time zone formatting)
Table:
Part 5 Links: Collation
(sorting, searching, grouping)
Table:
Part 6 Links:
Supplemental (supplemental data)
Table:
Part 7 Links: Keyboards
(keyboard mappings)
References
Acknowledgments
Modifications
1 Introduction
Not long ago, computer systems were like separate worlds,
isolated from one another. The internet and related events have
changed all that. A single system can be built of many different
components, hardware and software, all needing to work together. Many
different technologies have been important in bridging the gaps; in
the internationalization arena, Unicode has provided a lingua franca
for communicating textual data. However, there remain differences in
the locale data used by different systems.
The best practice for internationalization is to store and
communicate language-neutral data, and format that data for the
client. This formatting can take place on any of a number of the
components in a system; a server might format data based on the
user's locale, or it could be that a client machine does the
formatting. The same goes for parsing data, and locale-sensitive
analysis of data.
But there remain significant differences across systems and
applications in the locale-sensitive data used for such formatting,
parsing, and analysis. Many of those differences are simply
gratuitous; all within acceptable limits for human beings, but
yielding different results. In many other cases there are outright
errors. Whatever the cause, the differences can cause discrepancies
to creep into a heterogeneous system. This is especially serious in
the case of collation (sort-order), where different collation caused
not only ordering differences, but also different results of queries!
That is, with a query of customers with names between "Abbot,
Cosmo" and "Arnold, James", if different systems have
different sort orders, different lists will be returned. (For
comparisons across systems formatted as HTML tables, see [
Comparisons
].)
Note:
There are many different equally valid ways in which
data can be judged to be "correct" for a particular
locale. The goal for the common locale data is to make it as
consistent as possible with existing locale data, and acceptable to
users in that locale.
This document specifies an XML format for the communication of
locale data: the Unicode Locale Data Markup Language (LDML). This
provides a common format for systems to interchange locale data so
that they can get the same results in the services provided by
internationalization libraries. It also provides a standard format
that can allow users to customize the behavior of a system. With it,
for example, collation (sorting) rules can be exchanged, allowing two
implementations to exchange a specification of tailored collation
rules. Using the same specification, the two implementations will
achieve the same results in comparing strings. Unicode LDML can also
be used to let a user encapsulate specialized sorting behavior for a
specific domain, or create a customized locale for a minority
language. Unicode LDML is also used in the Unicode Common Locale Data
Repository (CLDR). CLDR uses an open process for reconciling
differences between the locale data used on different systems and
validating the data, to produce with a useful, common, consistent
base of locale data.
For more information, see the Common Locale Data Repository project
page [
LocaleProject
].
As LDML is an interchange format, it was designed for ease of
maintenance and simplicity of transformation into other formats,
above efficiency of run-time lookup and use. Implementations should
consider converting LDML data into a more compact format prior to
use.
1.1 Conformance
There are many ways to use the Unicode LDML format and the data
in CLDR, and the Unicode Consortium does not restrict the ways in
which the format or data are used. However, an implementation may
also claim conformance to LDML or to CLDR, as follows:
UAX35-C1.
An implementation that claims conformance to
this specification shall:
Identify the sections of the specification that it conforms
to.
For example, an implementation might claim conformance to
all LDML features except for
transforms
and
segments
Interpret the relevant elements and attributes of LDML
documents in accordance with the descriptions in those sections.
For example, an implementation that claims conformance to
the date format patterns must interpret the characters in such
patterns according to
Date Field
Symbol Table
Declare which types of CLDR data that it uses.
For example, an implementation might declare that it only
uses language names, and those with a
draft
status of
contributed
or
approved
UAX35-C2.
An implementation that claims conformance to
Unicode locale or language identifiers shall:
Specify whether Unicode locale extensions are allowed
Specify the canonical form used for identifiers in terms of
casing and field separator characters.
External specifications may also reference particular
components of Unicode locale or language identifiers, such as:
Field X can contain any Unicode region subtag values as given
in Unicode Technical Standard #35: Unicode Locale Data Markup
Language (LDML), excluding grouping codes.
2 What is a Locale?
Before diving into the XML structure, it is helpful to describe
the model behind the structure. People do not have to subscribe to
this model to use data in LDML, but they do need to understand it so
that the data can be correctly translated into whatever model their
implementation uses.
The first issue is basic:
what is a locale?
In this model, a
locale is an identifier (id) that refers to a set of user preferences
that tend to be shared across significant swaths of the world.
Traditionally, the data associated with this id provides support for
formatting and parsing of dates, times, numbers, and currencies; for
measurement units, for sort-order (collation), plus translated names
for time zones, languages, countries, and scripts. The data can also
include support for text boundaries (character, word, line, and
sentence), text transformations (including transliterations), and
other services.
Locale data is not cast in stone: the data used on
someone's machine generally may reflect the US format, for
example, but preferences can typically set to override particular
items, such as setting the date format for 2002.03.15, or using
metric or Imperial measurement units. In the abstract, locales are
simply one of many sets of preferences that, say, a website may want
to remember for a particular user. Depending on the application, it
may want to also remember the user's time zone, preferred
currency, preferred character set, smoker/non-smoker preference, meal
preference (vegetarian, kosher, and so on), music preference,
religion, party affiliation, favorite charity, and so on.
Locale data in a system may also change over time: country
boundaries change; governments (and currencies) come and go:
committees impose new standards; bugs are found and fixed in the
source data; and so on. Thus the data needs to be versioned for
stability over time.
In general terms, the locale id is a parameter that is supplied to a
particular service (date formatting, sorting, spell-checking, and so
on). The format in this document does not attempt to represent all
the data that could conceivably be used by all possible services.
Instead, it collects together data that is in common use in systems
and internationalization libraries for basic services. The main
difference among locales is in terms of language; there may also be
some differences according to different countries or regions.
However, the line between
locales
and
languages
, as
commonly used in the industry, are rather fuzzy. Note also that the
vast majority of the locale data in CLDR is in fact language data;
all non-linguistic data is separated out into a separate tree. For
more information, see
Section
3.10 Language and Locale IDs
We will speak of data as being "in locale X". That does not
imply that a locale
is
a collection of data; it is simply
shorthand for "the set of data associated with the locale id
X". Each individual piece of data is called a
resource
or
field
, and a tag indicating the key of the resource is called
resource tag.
3 Unicode
Language and Locale Identifiers
Unicode LDML uses stable identifiers based on [
BCP47
for distinguishing among languages, locales, regions, currencies,
time zones, transforms, and so on. There are many systems for
identifiers for these entities. The Unicode LDML identifiers may not
match the identifiers used on a particular target system. If so, some
process of identifier translation may be required when using LDML
data.
The BCP47 extensions (-u- and -t-) are described in
Section
3.6
Unicode BCP 47 U Extension
and
Section 3.7
Unicode
BCP 47 T Extension
3.1 Unicode Language
Identifier
Unicode language identifier
has the following structure
(provided in either EBNF (Perl-based) or ABNF [
RFC5234
]).
The following table defines syntactically well-formed identifiers:
they are not necessarily valid identifiers. For additional validity
criteria, see the links on the right.
EBNF
ABNF
Validity
unicode_language_id
="root"
| (unicode_language_subtag
(sep
unicode_script_subtag)?
| unicode_script_subtag)
(sep unicode_region_subtag)?
(sep
unicode_variant_subtag)* ;
="root"
/ (unicode_language_subtag
[sep
unicode_script_subtag]
/ unicode_script_subtag)
[sep unicode_region_subtag]
*(sep
unicode_variant_subtag)
unicode_language_subtag
= alpha{2,3} | alpha{5,8};
= 2*3ALPHA / 5*8ALPHA
validity
latest-data
unicode_script_subtag
= alpha{4} ;
= 4ALPHA
validity
latest-data
unicode_region_subtag
= (alpha{2} | digit{3}) ;
= 2ALPHA / 3DIGIT
validity
latest-data
unicode_variant_subtag
= (alphanum{5,8}
| digit alphanum{3}) ;
= 5*8alphanum
/ (DIGIT 3alphanum)
validity
latest-data
sep
= [-_] ;
= "-" / "_"
digit
= [0-9] ;
alpha
= [A-Z a-z] ;
alphanum
= [0-9 A-Z a-z] ;
= ALPHA / DIGIT
The semantics of the various subtags is explained in
Section
3.4
Language Identifier Field
Definitions
; there are also direct links from
unicode_language_subtag
, etc. While theoretically the
unicode_language_subtag
may have more than 3 letters through the IANA registration process,
in practice that has not occurred. The
unicode_language_subtag
"und" may be omitted when there is a
unicode_script_subtag
; for that reason
unicode_language_subtag
values with 4 letters are not permitted. However, such
unicode_language_id
values are not intended for general interchange, because they are not
valid BCP47 tags. Instead, they are intended for certain protocols
such as the identification of transliterators or font ScriptLangTag
values.
For example, "en-US" (American English),
"en_GB" (British English), "es-419" (Latin
American Spanish), and "uz-Cyrl" (Uzbek in Cyrillic) are
all valid Unicode language identifiers.
3.2 Unicode Locale Identifier
Unicode locale identifier
is composed of a Unicode language
identifier plus (optional) locale extensions (U and T). It has the
following structure. The semantics of the U and T extensions are
explained in
Section 3.6
Unicode
BCP 47 U Extension
and
Section 3.7
Unicode
BCP 47 T Extension
. The following table defines syntactically
well-formed identifiers: they are not necessarily valid identifiers.
For additional validity criteria, see the links on the right.
EBNF
ABNF
Validity
unicode_locale_id
= unicode_language_id
(transformed_extensions
unicode_locale_extensions?
unicode_locale_extensions?
transformed_extensions?) ;
= unicode_language_id
([trasformed_extensions
[unicode_locale_extensions]]
/ [unicode_locale_extensions
[transformed_extensions]])
unicode_locale_extensions
= sep "u"
((sep keyword)+
|(sep attribute)+
(sep keyword)*) ;
= sep "u"
(1*(sep keyword)
/ 1*(sep
attribute) *(sep keyword))
transformed_extensions
= sep "t"
(("-" tlang ("-" tfield)*)
| ("-" tfield)+) ;
= sep "t"
(("-" tlang
*("-" tfield))
/ 1*("-" tfield))
keyword
= key (sep type)? ;
= key [sep type]
key
= alphanum ALPHA ;
= alphanum ALPHA
validity
latest-data
type
= alphanum{3,8}
(sep alphanum{3,8})* ;
= 3*8alphanum
*(sep 3*8alphanum)
validity
latest-data
attribute
= alphanum{3,8} ;
= 3*8alphanum
unicode_subdivision_id
unicode_region_subtag
unicode_subdivision_suffix ;
unicode_region_subtag
unicode_subdivision_suffix
validity
latest-data
unicode_subdivision_suffix
= (alphanum{1,4} ;
= 1*4alphanum
unicode_measure_unit
= alphanum{3,8}
(sep alphanum{3,8})* ;
= 3*8alphanum
*(sep 3*8alphanum)
validity
latest-data
tlang
= unicode_language_subtag
("-"
unicode_script_subtag)?
("-"
unicode_region_subtag)?
("-"
unicode_variant_subtag)* ;
= unicode_language_subtag
["-"
unicode_script_subtag]
["-"
unicode_region_subtag]
*("-"unicode_variant_subtag)
tfield
= tkey tvalue;
= tkey tvalue
validity
latest-data
tkey
= alpha digit ;
= ALPHA DIGIT
tvalue
= ("-" alphanum{3,8})+ ;
= 1*("-" 3*8alphanum)
For historical reasons, this is called a Unicode locale identifier.
However, it really functions (with few exceptions) as a
language
identifier, and accesses
language
-based
data. Except where it would be unclear, this document uses the term
"locale" data loosely to encompass both types of data: for
more information, see
Section
3.10 Language and Locale IDs
Although not shown in the syntax above, Unicode locale identifiers
may also have [
BCP47
] extensions (other than
"u" and "t") and private use subtags; these are
not, however, relevant to their use in Unicode.
As for terminology, the term
code
may also be used instead of
"subtag", and "territory" instead of
"region". The primary language subtag is also called the
base
language code
. For example, the base language code for
"en-US" (American English) is "en" (English). The
type
may also be referred to as a
value
or
key-value
The identifiers can vary in case and in the separator characters. The
"-" and "_" separators are treated as equivalent.
All identifier field values are case-insensitive. Although case
distinctions do not carry any special meaning, an implementation of
LDML should use the casing recommendations in [
BCP47
],
especially when a Unicode locale identifier is used for locale data
exchange in software protocols. The recommendation is that: the
region subtag is in uppercase, the script subtag is in title case,
and all other subtags are in lowercase.
Note:
The current version of CLDR uses upper case letters for
variant subtags in its file names for backward compatibility reasons.
This might be changed in future CLDR releases.
3.3 BCP
47 Conformance
Unicode language and locale identifiers inherit the design and the
repertoire of subtags from [
BCP47
] Language
Tags. There are some extensions and restrictions made for the use of
the Unicode locale identifier in CLDR:
It does not allow for the full syntax of [
BCP47
]:
No irregular or BCP47 grandfathered tags are allowed
No extlang subtags are allowed
It allows for certain additions:
For field separator characters, the "_" character can be
used as well as the "-" used in [
BCP47
].
"root" to indicate the generic locale used as the parent
of all languages in the CLDR data model.
Defined semantics of certain private use codes, and some
"macrolanguage" codes.
3.3.1 BCP 47 Language Tag
Conversion
A Unicode language/locale identifier can be converted to a valid [
BCP 47
] language tag by performing the following
transformation.
Replace the "_" separators with "-"
Replace the special language identifier "root" with the BCP
47 primary language tag "und"
For example,
en_US
en-US
de_DE_u_co_phonebk
de-DE-u-co-phonebk
root
und
root_u_cu_usd
und-u-cu-usd
A valid [
BCP 47
] language tag can be converted
to a valid Unicode language/locale identifier by performing the
following transformation.
Canonicalize the language tag (afterwards, there will be no
extlang subtag)
Replace the BCP 47 primary language subtag "und" with "root"
if no script, region, or variant subtags are present
If the BCP 47 primary language subtag matches the
type
attribute of a
languageAlias
element in
Supplemental Data
replace the language subtag with the
replacement
value.
If there are additional subtags in the
replacement
value, add them to the result, but only if there is no
corresponding subtag already in the tag.
If the BCP 47 region subtag matches the
type
attribute of a
territoryAlias
element in
Supplemental Data
replace the language subtag with the
replacement
value, as
follows:
If there is a single territory in the replacement, use it.
If there are multiple territories:
Look up the most likely territory for the base language
code (and script, if there is one).
If that likely territory is in the list, use it.
Otherwise, use the first territory in the list.
Examples
Original
Result
Comments
en-US
en-US
no changes
und
root
no changes
und-US
und-US
no changes, because region subtag is present
und-u-cu-USD
root-u-cu-usd
changes, because no script, region, or variant tag is
present
cmn-TW
zh-TW
language alias
sr-CS
sr-RS
territory alias
sh
sr-Latn
multiple replacement subtags, 3.1 above
sh-Cyrl
sr-Cyrl
no replacement with multiple replacement subtags, 3.1 above
hy-SU
hy-AM
multiple territory values, 4.2 above
type="SU" replacement="RU AM AZ BY EE GE KZ KG LV
LT MD TJ TM UA UZ" …/>
Note:
In some rare cases, BCP 47 language tags cannot be
converted to valid Unicode language/locale identifiers, such as
certain [
BCP 47
] grandfathered tags.
3.4
Language Identifier Field Definitions
Unicode language and locale identifier field values are provided in
the following table. Note that some private-use BCP 47 field values
are given specific meanings in CLDR. While field values are based on
BCP47
] subtag values, their validity status in
CLDR is specified by means of machine-readable files in the
common/validity/
subdirectory, such as language.xml. For the format of those files and
more information, see
Section
3.11 Validity Data
Language Identifier
Field Definitions
Field
Valid values
unicode_language_subtag
(also known as a
Unicode base language code)
Subtags in the language.xml file (see
Section 3.11
Validity Data
). These are based on [
BCP47
] subtag values
marked as
Type: language
ISO 639-3 introduces the notion of
"macrolanguages", where certain ISO 639-1 or ISO 639-2
codes are given broad semantics, and additional codes are given
for the narrower semantics. For backwards compatibility, Unicode
language identifiers retain use of the narrower semantics for
these codes. For example:
For
Use
Not
Standard Chinese (Mandarin)
zh
cmn
Standard Arabic
ar
arb
Standard Malay
ms
zsm
Standard Swahili
sw
swh
Standard Uzbek
uz
uzn
Standard Konkani
kok
knn
If a language subtag matches the type attribute of a languageAlias
element, then the replacement value is used instead. For example,
because "swh" occurs in

, "sw" must be used instead of "swh". Thus Unicode language
identifiers use "ar-EG" for Standard Arabic (Egypt), not
"arb-EG"; they use "zh-TW" for Mandarin
Chinese (Taiwan), not "cmn-TW".
The private use codes from
qfz..qtz
will never be given specific semantics in Unicode identifiers, and
are thus safe for use for other purposes by other applications.
The CLDR provides data for normalizing language/locale
codes, including mapping overlong codes like "eng-840"
or "eng-USA" to the correct code "en-US".
unicode_script_subtag
(also known as a
Unicode script code)
Subtags in the script.xml file (see
Section 3.11
Validity Data
). These are based on [
BCP47
] subtag values marked as
Type:
script
In most cases the script is not necessary, since the
language is only customarily written in a single script. Examples
of cases where it is used are:
az_Arab
Azerbaijani in Arabic script
az_Cyrl
Azerbaijani in Cyrillic script
az_Latn
Azerbaijani in Latin script
zh_Hans
Chinese, in simplified script (=zh, zh-Hans, zh-CN,
zh-Hans-CN)
zh_Hant
Chinese, in traditional script
Unicode identifiers give specific semantics to six Unicode Script values. For more information, see also [
UAX24
]:
Qaai
Inherited
deprecated
: the
canonicalized
form is Zinh
Zinh
Inherited
Zsye
Emoji Style
Prefer emoji style for characters that have both text
and emoji styles available.
Zsym
Text Style
Prefer text style for characters that have both text and
emoji styles available.
Zxxx
Unwritten
Indicates spoken or otherwise unwritten content. For example:
Sample(s)
Description
uz
either written or spoken content
uz-Latn
or
uz-Arab
written-only content (particular script)
uz-Zyyy
written-only content (unspecified script)
uz-Zxxx
spoken-only content
uz-Latn, uz-Zxxx
both specific written and spoken content (using a
language list
Zyyy
Common
Zzzz
Unknown
The private use subtags from Qaaq..Qabx will never be given
specific semantics in Unicode identifiers, and are thus safe for
use for other purposes by other applications.
unicode_region_subtag
(also known as a
Unicode region code,
or
a Unicode
territory code)
Subtags in the region.xml file (see
Section 3.11
Validity Data
). These are based on [
BCP47
] subtag values marked as
Type:
region
Unicode identifiers give specific semantics to the following
subtags:
Name
Comment
ISO 3166-1 status
QO
Outlying Oceania
countries in Oceania [009] that do not have a
subcontinent
private use
QU
European Union
deprecated
: the
canonicalized
form is EU
private use
UK
United Kingdom
deprecated
: the
canonicalized
form is GB
exceptionally reserved
XK
Kosovo
industry practice
private use
ZZ
Unknown or Invalid Territory
used in APIs or as replacement for invalid code
private use
The private use subtags from XA..XZ will normally never be
given specific semantics in Unicode identifiers, and are thus safe
for use for other purposes by other applications. However, LDML
may follow widespread industry practice in the use of some of
these codes, such as for XK.
The CLDR provides data for normalizing territory/region
codes, including mapping overlong codes like "eng-840"
or "eng-USA" to the correct code "en-US".
Special Codes:
The territory code 'UK' has a special status in ISO, and
is used for the domain name instead of GB. It is thus recognized
by CLDR as being an alternate (unnormalized) form of 'GB'.
The territory code '001' (the World) is used to indicate
a standardized form, such as "ar-001" for Modern
Standard Arabic.
unicode_variant_subtag
(also known as a
Unicode language variant code)
Subtags in the variant.xml file (see
Section 3.11
Validity Data
). These are based on [
BCP47
] subtag values
marked as
Type: variant
CLDR provides data for normalizing variant codes. About handling
of the "POSIX" variant see
Section 3.8.2,
Legacy Variants
Examples:
en
fr_BE
zh-Hant-HK
Deprecated
codes—such as QU above—are valid, but strongly
discouraged.
A locale that only has a language subtag (and optionally a script
subtag) is called a
language locale
; one with both language
and territory subtag is called a
territory locale
(or
country
locale
).
3.5 Special Codes
3.5.1 Unknown or Invalid
Identifiers
The following identifiers are used to indicate an unknown or
invalid code in Unicode language and locale identifiers. For Unicode
identifiers, the region code uses a private use ISO 3166 code, and
Time Zone code uses an additional code; the others are defined by the
relevant standards. When these codes are used in APIs connected with
Unicode identifiers, the meaning is that either there was no
identifier available, or that at some point an input identifier value
was determined to be invalid or ill-formed.
Code Type
Value
Description in Referenced Standards
Language
und
Undetermined language
Script
Zzzz
Code for uncoded script, Unknown [
UAX24
Region
ZZ
Unknown or Invalid Territory
Currency
XXX
The codes assigned for transactions where no currency is
involved
Time Zone
unk
Unknown or Invalid Time Zone
Subdivision
ZZZZ
Unknown or Invalid Subdivision
When only the script or region are known, then a locale ID will
use "und" as the language subtag portion. Thus the locale
tag "und_Grek" represents the Greek script;
"und_US" represents the US territory.
3.5.2 Numeric Codes
For region codes, ISO and the UN establish a mapping to
three-letter codes and numeric codes. However, this does not extend
to the private use codes, which are the codes 900-999 (total: 100),
and AAA, QMA-QZZ, XAA-XZZ, and ZZZ (total: 1092). Unicode identifiers
supply a standard mapping to these: for the numeric codes, it uses
the top of the numeric private use range; for the 3-letter codes it
doubles the final letter. These are the resulting mappings for all of
the private use region codes:
Region
UN/ISO Numeric
ISO 3-Letter
AA
958
AAA
QM..QZ
959..972
QMM..QZZ
XA..XZ
973..998
XAA..XZZ
ZZ
999
ZZZ
For script codes, ISO 15924 supplies a mapping (however, the
numeric codes are not in common use):
Script
Numeric
Qaaa..Qabx
900..949
3.5.3
Private Use Codes
Private use codes fall into three groups.
defined:
those that are given particular
semantics currently in CLDR
reserved:
those that may be given
particular semantics in future versions of CLDR
excluded:
those that will never be given
particular CLDR semantics in the future, and thus can normally be
used by applications without worrying about collisions. However,
CLDR may follow widespread industry practice in the use of some of
these codes, such as for XK.
Private Use
Codes in CLDR
category
status
codes
base language
defined
none
reserved
qaa..qfy
excluded
qfz..qtz
script
defined
Qaai (obsolete)
reserved
Qaaa..Qaap
excluded
Qaaq..Qabx
region
defined
QO, QU, UK, XK, ZZ
reserved
AA, QM..QZ
excluded
XA..XJ, XL..XZ
timezone
defined
IANA: Etc/Unknown
bcp47: as listed in the bcp47
file
reserved
bcp47: all non-5 letter codes not starting with x
excluded
bcp47: all non-5 letter codes starting with x
See also
Section 3.5.1
Unknown or Invalid
Identifiers
3.6 Unicode BCP 47 U
Extension
BCP47
] Language Tags provides a mechanism for
extending language tags for use in various applications by extension
subtags. Each extension subtag is identified by a single alphanumeric
character subtag assigned by IANA.
The Unicode Consortium has registered and is the maintaining
authority for two BCP 47 language tag extensions: the extension 'u'
for Unicode locale extension [
RFC6067
] and
extension 't' for transformed content [
RFC6497
].
The Unicode BCP 47 extension data defines the complete list of valid
subtags.
These subtags are all in lowercase (that is the canonical casing for
these subtags), however, subtags are case-insensitive and casing does
not carry any specific meaning. All subtags within the Unicode
extensions are alphanumeric characters in length of two to eight that
meet the rule
extension
in the [
BCP47
The -u- Extension.
The syntax of 'u' extension
subtags is defined by the rule
unicode_locale_extensions
in
Section 3.2 Unicode
locale identifier
, except the separator of subtags
sep
must be always hyphen '-' when the extension is used as a part of BCP
47 language tag.
A 'u' extension may contain multiple
attribute
s or
keyword
s as defined in
Section 3.2
Unicode locale identifier
. Although the order of
attribute
s or
keyword
s does not matter, this specification defines the canonical form as
below:
All attributes are sorted in alphabetical order.
All keywords are sorted by alphabetical order of keys.
All keywords are in lowercase.
All keys and types use the canonical form (from the name
attribute; see
Section
3.6.4 U Extension Data Files
).
Type value "true" is removed.
For example, the canonical form of 'u' extension
"u-foo-bar-nu-thai-ca-buddhist-kk-true" is
"u-bar-foo-ca-buddhist-kk-nu-thai". The attributes "foo" and "bar" in
this example are provided only for illustration; no attribute subtags
are defined by the current CLDR specification.
See also
Unicode
Extensions for BCP 47
on the CLDR site.
3.6.1
Key And Type Definitions
The following chart contains a set of U extension key values
that are currently available, with a description or sampling of the U
extension type values. Each category is associated with an XML file
in the bcp47 directory.
For the complete list of valid keys and types defined for Unicode
locale extensions, see
Section
3.6.4 U Extension Data Files
. For information on the process for
adding new
key
type
, see [
LocaleProject
].
Most type values are represented by a single subtag in the current
version of CLDR. There are exceptions, such as types used for key
"ca" (calendar) and "kr" (collation reordering). If the type is not
included, then the type value "true" is assumed. Note that the
default for key with a possible "true" value is often
"false", but may not always be. Note also that
"true"/"True" is not a valid script code, since
the ISO
15924 Registration Authority has exceptionally reserved it
, which
means that it will not be assigned for any purpose.
The BCP47 form for keys and types is the canonical form, and
recommended. Other aliases are included for backwards compatibility.
Key/Type
Definitions
key
(old key name)
key description
example type
(old type name)
type description
Unicode
Calendar Identifier
defines a type of calendar. The valid values
are those
name
attribute values in the
type
elements of key name="ca" in bcp47/
calendar.xml
"ca"
(calendar)
Calendar algorithm
(For
information on the calendar algorithms associated with the data
used with these, see [
Calendars
].)
"buddhist"
Thai Buddhist calendar (same as Gregorian except for the
year)
"chinese"
Traditional Chinese calendar
"gregory"
(gregorian)
Gregorian calendar
"islamic"
Islamic calendar
"islamic-civil"
Islamic calendar, tabular (intercalary years
[2,5,7,10,13,16,18,21,24,26,29] - civil epoch)
"islamic-umalqura"
Islamic calendar, Umm al-Qura
Note:
Some calendar types are
represented by two subtags. In such cases, the first subtag
specifies a generic calendar type and the second subtag specifies
a calendar algorithm variant. The CLDR uses generic calendar types
(single subtag types) for tagging data when calendar algorithm
variations within a generic calendar type are irrelevant. For
example, type "islamic" is used for specifying Islamic calendar
formatting data for all Islamic calendar types, including
"islamic-civil" and "islamic-umalqura".
Unicode Currency Format
Identifier
defines a style for currency formatting. The valid
values are those
name
attribute values in the
type
elements of key name="cf" in bcp47/
currency.xml
"cf"
Currency Format style
"standard"
Negative numbers use the minusSign symbol (the default).
"account"
Negative numbers use parentheses or equivalent.
Unicode Collation Identifier
defines a type of collation (sort order). The valid values are
those
name
attribute values in the
type
elements
of bcp47/
collation.xml
For information on each collation
setting parameter, from
ka
to
vt
see
Setting
Options
"co"
(collation)
Collation type
"standard"
The default ordering for each language. For root it is
based on the [
DUCET
] (Default Unicode
Collation Element Table): see
Root Collation
. Each
other locale is based on that, except for appropriate modifications
to certain characters for that language.
"search"
A special collation type dedicated for string search—it is
not used to determine the relative order of two strings, but only
to determine whether they should be considered equivalent for the
specified strength, using the string search matching rules
appropriate for the language. Compared to the normal collator for
the language, this may add or remove primary equivalences, may make
additional characters ignorable or change secondary equivalences,
and may modify contractions to allow matching within them,
depending on the desired behavior. For example, in Czech, the
distinction between ‘a’ and ‘á’ is secondary for normal collation,
but primary for search; a search for ‘a’ should never match ‘á’ and
vice versa. A search collator is normally used with strength set to
PRIMARY or SECONDARY (should be SECONDARY if using “asymmetric”
search as described in the [
UCA
] section
Asymmetric Search). The search collator in root supplies matching
rules that are appropriate for most languages (and which are
different than the root collation behavior); language-specific
search collators may be provided to override the matching rules for
a given language as necessary.
Other keywords provide additional choices for certain locales;
they
only have effect in certain locales.
"phonetic"
Requests a phonetic variant if available, where text is
sorted based on pronunciation. It may interleave different scripts,
if multiple scripts are in common use.
"pinyin"
Pinyin ordering for Latin and for CJK characters; that is,
an ordering for CJK characters based on a character-by-character
transliteration into a pinyin. (used in Chinese)
"reformed"
Reformed collation (such as in Swedish)
"searchjl"
Special collation type for a modified string search in
which a pattern consisting of a sequence of Hangul initial
consonants (jamo lead consonants) will match a sequence of Hangul
syllable characters whose initial consonants match the pattern. The
jamo lead consonants can be represented using conjoining or
compatibility jamo. This search collator is best used at SECONDARY
strength with an "asymmetric" search as described in the [
UCA
] section
Asymmetric Search and obtained, for example, using ICU4C's usearch
facility with attribute USEARCH_ELEMENT_COMPARISON set to value
USEARCH_PATTERN_BASE_WEIGHT_IS_WILDCARD; this ensures that a full
Hangul syllable in the search pattern will only match the same
syllable in the searched text (instead of matching any syllable
with the same initial consonant), while a Hangul initial consonant
in the search pattern will match any Hangul syllable in the
searched text with the same initial consonant.
Unicode
Currency Identifier
defines a type of currency. The valid values
are those
name
attribute values in the
type
elements of key name="cu" in bcp47/
currency.xml
"cu"
(currency)
Currency type
ISO 4217 code,
plus others in common use
Codes consisting of 3 ASCII letters that are or have been valid in
ISO 4217, plus certain additional codes that are or have been in
common use. The list of countries and time periods associated with
each currency value is available in
Supplemental
Currency Data
, plus the default number of decimals.
The XXX code is given a broader interpretation as
Unknown
or Invalid Currency
Unicode
Emoji Presentation Style Identifier
specifies a request for
the preferred emoji presentation style. This can be used as part of
the value for an HTML lang attribute, for example

The valid values are those
name
attribute values
in the
type
elements of key name="em" in bcp47/
variant.xml
"em"
Emoji presentation style
"emoji"
Use an emoji presentation for emoji characters if possible.
"text"
Use a text presentation for emoji characters if possible.
"default"
Use the default presentation for emoji characters as specified in UTR #51 Section 4,
Presentation Style
Unicode
First Day Identifier
defines the preferred first day of the week
for calendar display. Specifying "fw" in a locale identifier
overrides the default value specified by supplemental week data
(see Part 4 Dates, section 4.3
Week
Data
). The valid values are those
name
attribute values
in the
type
elements of key name="fw" in bcp47/
calendar.xml
"fw"
First day of week
"sun"
Sunday
"mon"
Monday
"sat"
Saturday
Unicode Hour Cycle
Identifier
defines the preferred time cycle. Specifying "hc" in a
locale identifier overrides the the default value specified by
supplemental time data (see Part 4 Dates, section 4.4
Time Data
). The valid values
are those
name
attribute values in the
type
elements of key name="hc" in bcp47/
calendar.xml
"hc"
Hour cycle
"h12"
Hour system using 1–12; corresponds to 'h' in patterns
"h23"
Hour system using 0–23; corresponds to 'H' in patterns
"h11"
Hour system using 0–11; corresponds to 'K' in patterns
"h24"
Hour system using 1–24; corresponds to 'k' in pattern
Unicode Line Break
Style Identifier
defines a preferred line break style
corresponding to the CSS level 3
line-break
option
. Specifying "lb" in a locale identifier overrides the
locale‘s default style (which may correspond to "normal" or
"strict"). The valid values are those
name
attribute
values in the
type
elements of key name="lb" in bcp47/
segmentation.xml
"lb"
Line break style
"strict"
CSS level 3 line-break=strict, e.g. treat CJ as NS
"normal"
CSS level 3 line-break=normal, e.g. treat CJ as ID, break
before hyphens for ja,zh
"loose"
CSS lev 3 line-break=loose
Unicode Line Break Word
Identifier
defines preferred line break word handling behavior
corresponding to the CSS level 3
word-break
option
. The valid values are those
name
attribute values
in the
type
elements of key name="lw" in bcp47/
segmentation.xml
"lw"
Line break word handling
"normal"
CSS level 3 word-break=normal, normal script/language
behavior for midword breaks
"breakall"
CSS level 3 word-break=break-all, allow midword breaks
unless forbidden by lb setting
"keepall"
CSS level 3 word-break=keep-all, prohibit midword breaks
except for dictionary breaks
Unicode Measurement
System Identifier
defines a preferred measurement system.
Specifying "ms" in a locale identifier overrides the default value
specified by supplemental measurement system data (see Part 2
General, section 5
Measurement
System Data
). The valid values are those
name
attribute
values in the
type
elements of key name="ms" in bcp47/
measure.xml
"ms"
Measurement system
"metric"
Metric System
"ussystem"
US System of measurement: feet, pints, etc.; pints are 16oz
"uksystem"
UK System of measurement: feet, pints, etc.; pints are 20oz
Unicode Number System
Identifier
defines a type of number system. The valid values are
those
name
attribute values in the
type
elements
of bcp47/
number.xml
"nu"
(numbers)
Numbering system
Unicode script subtag
Four-letter types indicating the primary numbering system for the
corresponding script represented in Unicode. Unless otherwise
specified, it is a decimal numbering system using digits
[:GeneralCategory=Nd:]. For example, "latn" refers to
the ASCII / Western digits 0-9, while "taml" is an
algorithmic (non-decimal) numbering system. (The code "tamldec" is
indicates the "modern Tamil decimal digits".)
For more information, see
Numbering Systems
"arabext"
Extended Arabic-Indic digits ("arab" means the base
Arabic-Indic digits)
"armnlow"
Armenian lowercase numerals
"roman"
Roman numerals
"romanlow"
Roman lowercase numerals
"tamldec"
Modern Tamil decimal digits
Region Override
specifies an alternate
region to use for obtaining certain region-specific default values
(those specified by the

element), instead of using the region specified by the
unicode_region_subtag
in the
Unicode Language Identifier (or inferred from the
unicode_language_subtag
).
"rg"
Region Override
"uszzzz"
The value is a
unicode_region_subtag
for a regular region (not a macroregion), suffixed by "ZZZZ" (case
is not significant). For example, “en-GB-u-rg-uszzzz” represents a
locale for British English but with region-specific defaults set to
US for items such as default currency, default calendar and week
data, default time cycle, and default measurement system and unit
preferences.
Unicode Subdivision
Identifier
defines a regional subdivision used for locales. The
valid values are based on the
subdivisionContainment
element as described in
Section
3.6.5 Subdivision Codes
"sd"
Regional Subdivision
"gbsct"
unicode_subdivision_id
, which is
unicode_region_subtag
concatenated
with a unicode_subdivision_suffix.
For example,
gbsct
is “gb”+“sct” (where sct
represents the subdivision code for Scotland). Thus
“en-GB-u-sd-gbsct” represents the language variant “English as used
in Scotland”. And both “en-u-sd-usca” and “en-US-u-sd-usca”
represent “English as used in California”. See
3.6.5
Subdivision Codes
Unicode
Sentence Break Suppressions Identifier
defines a set of data to
be used for suppressing certain sentence breaks that would
otherwise be found by UAX #14 rules. The valid values are those
name
attribute values in the
type
elements of key name="ss" in
bcp47/
segmentation.xml
"ss"
Sentence break suppressions
"none"
Don’t use sentence break suppressions data (the default).
"standard"
Use sentence break suppressions data of type "standard"
Unicode
Timezone Identifier
defines a timezone. The valid values are
those name attribute values in the
type
elements of
bcp47/
timezone.xml
"tz"
(timezone)
Time zone
Unicode short time zone IDs
Short identifiers defined in terms of a TZ time zone database [
Olson
] identifier in the file
common/bcp47/timezone.xml file, plus a few extra values.
For more information, see
Section
3.7.1.2 Time Zone Identifiers
CLDR provides data for normalizing timezone codes.
Unicode
Variant Identifier
defines a special variant used for locales.
The valid values are those name attribute values in the
type
elements of bcp47/
variant.xml
"va"
Common variant type
"posix"
POSIX style locale variant. About handling of the "POSIX"
variant see
Section 3.8.2,
Legacy
Variants
For more information on the allowed keys and types, see the specific
elements below, and
Section
3.6.4 U Extension Data Files
Additional keys or types might be added in future versions.
Implementations of LDML should be robust to handle any syntactically
valid key or type values.
3.6.2
Numbering System Data
LDML supports multiple numbering systems. The identifiers for those
numbering systems are defined in the file
bcp47/number.xml
For example, for the 'trunk' version of the data see
bcp47/number.xml
Details about those numbering systems are defined in
supplemental/numberingSystems.xml
For example, for the 'trunk' version of the data see
supplemental/numberingSystems.xml
LDML makes certain stability guarantees on this data:
Like other BCP47 identifiers, once a numeric identifier is
added to
bcp47/number.xml
or
numberingSystems.xml
it will never be removed from either of those files.
If an identifier has type="numeric" in numberingSystems.xml,
then
It is a decimal, positional numbering system with an
attribute digits=X, where X is a string with the 10 digits in
order used by the numbering system.
The values of the type and digits will never change.
3.6.3
Time Zone Identifiers
LDML inherits time zone IDs from the tz database [
Olson
].
Because these IDs from the tz database do not satisfy the BCP 47
language subtag syntax requirements, CLDR defines short identifiers
for the use in the Unicode locale extension. The short identifiers
are defined in the file
common/bcp47/timezone.xml
The short identifiers use UN/LOCODE [
LOCODE
(excluding a space character) codes where possible. For example, the
short identifier for "America/Los_Angeles" is "uslax" (the LOCODE for
Los Angeles, US is "US LAX"). Identifiers of length not equal to 5
are used where there is no corresponding UN/LOCODE, such as
"usnavajo" for "America/Shiprock", or "utcw01" for "Etc/GMT+1", so
that they do not overlap with future UN/LOCODE.
Although the first two letters of a short identifier may match
an ISO 3166 two-letter country code, a user should not assume that
the time zone belongs to the country. The first two letters in an
identifier of length not equal to 5 has no meaning. Also, the
identifiers are stabilized, meaning that they will not change no
matter what changes happen in the base standard. So if Hawaii leaves
the US and joins Canada as a new province, the short time zone
identifier "ushnl" would not change in CLDR even if the UN/LOCODE
changes to "cahnl" or something else.
There is a special code "unk" for an Unknown or Invalid time
zone. This can be expressed in the tz database style ID
"Etc/Unknown", although it is not defined in the tz database.
Stability of Time Zone Identifiers
Although the short time zone identifiers are guaranteed to be stable,
the preferred IDs in the tz database (as those found in
zone.tab
file) might be changed time to time. For example, "Asia/Culcutta" was
replaced with "Asia/Kolkata" and moved to
backward
file in the tz database. CLDR contains locale data using a time zone
ID from the tz database as the key, stability of the IDs is cirtical.
To maintain the stability of "long" IDs (for those inherited from the
tz database), a special rule applied to the
alias
attribute in
the element for "tz" - the first "long" ID is the CLDR
canonical "long" time zone ID.
For example:

Above element defines the short time zone ID "inccu"
(for the use in the Unicode locale extension), corresponding
CLDR
canonical "long" ID
"Asia/Culcutta", and an alias "Asia/Kolkata".
3.6.4 U Extension
Data Files
The 'u' extension data is stored in multiple XML files located under
common/bcp47 directory in CLDR. Each file contains the locale
extension key/type values and their backward compatibility mappings
appropriate for a particular domain.
common/bcp47/collation.xml
contains key/type values for collation, including optional collation
parameters and valid type values for each key.
The 't' extension data is stored in
common/bcp47/transform.xml

NMTOKEN #IMPLIED>
#REQUIRED>
#IMPLIED>
"false">

| incremental | any) #IMPLIED >
CDATA #IMPLIED>

#REQUIRED>
#IMPLIED>
"false">

since CDATA #IMPLIED>

NMTOKEN #REQUIRED>
CDATA #IMPLIED>
| false ) "false">
NMTOKEN #IMPLIED>
#IMPLIED>
The extension attribute in element specifies the
BCP 47 language tag extension type. The default value of the
extension attribute is "u" (Unicode locale extension). The
element is only applicable to the enclosing .
In the Unicode locale extension 'u' and
't' data files, the common attributes for the ,
and elements are as follows:
name
The key or type name used by Unicode locale extension with
'u' extension syntax
or the 't' extensions syntax. When
alias
below is absent, this name can be also used with the old style
"@key=type" syntax
Most type names are
literal type names
, which
match exactly the same value. All of these have at least one
lowercase letter, such as "buddhist". There are a small
number of
indirect type names
, such as
"RG_KEY_VALUE". These have no lowercase letters. The
interpretation of each one is listed below.
3.6.4.1 CODEPOINTS
The type name
"CODEPOINTS"
is reserved for a
variable representing Unicode code point(s). The syntax is:
EBNF
ABNF
codepoints
= codepoint (sep codepoint)?
= codepoint *(sep codepoint)
codepoint
= [0-9 A-F a-f]{4,6}
= 4*6HEXDIG
In addition, no codepoint may exceed 10FFFF. For example,
"00A0", "300b", "10D40C" and "00C1-00E1" are valid, but "A0",
"U060C" and "110000" are not.
In the current version of CLDR, the type "CODEPOINTS" is only
used for the deprecated locale extension key "vt" (variableTop).
The subtags forming the type for "vt" represent an arbitrary string
of characters. There is no formal limit in the number of
characters, although practically anything above 1 will be rare, and
anything longer than 4 might be useless. Repetition is allowed, for
example, 0061-0061 ("aa") is a Valid type value for "vt", since the
sequence may be a collating element. Order is vital: 0061-0062
("ab") is different than 0062-0061 ("ba"). Note that for
variableTop any character sequence must be a contraction which
yields exactly one primary weight.
For example,
en-u-vt-00A4
: this indicates English, with any
characters sorting at or below " ¤" (at a primary level)
considered Variable.
By default in UCA, variable characters are ignored in sorting at a
primary, secondary, and tertiary level. But in CLDR, they are not
ignorable by default. For more information, see
Collation: Section
3.3
Setting Options
3.6.4.2
REORDER_CODE
The type name
"REORDER_CODE"
is reserved for
reordering block names (e.g. "latn", "digit" and "others") defined
in the
Root
Collation
. The type "REORDER_CODE" is used for locale extension
key "kr" (colReorder). The value of type for "kr" is represented by
one or more reordering block names such as "latn-digit". For more
information, see
Collation:
Section 3.12
Collation Reordering
3.6.4.3
RG_KEY_VALUE
The type name
"RG_KEY_VALUE"
is reserved for
region codes in the format required by the "rg" key; this is a
region code from the idValidity data in common/validity/region.xml
(with certain exclusions, listed below) followed by "zzzz". The
excluded region codes are those with idStatus='unknown' and
'macroregion'; region codes with idStatus='deprecated' should not
be generated, and those with idStatus='private_use' are only to be
used with prior agreement. Thus the value for the "rg" key will
normally be a region code with idStatus='regular' followed by
"zzzz"; this set of values is the same as the subdivision codes
with idStatus='unknown' from the idValidity data in
common/validity/subdivision.xml.
3.6.4.4
SUBDIVISION_CODE
The type name
"SUBDIVISION_CODE"
is reserved for
subdivision codes in the format required by the "sd" key; this is a
subdivision code from the idValidity data in
common/validity/subdivision.xml, excluding those with
idStatus='unknown'. Codes with idStatus='deprecated' should not be
generated, and those with idStatus='private_use' are only to be
used with prior agreement.
3.6.4.5 PRIVATE_USE
The type name
"PRIVATE_USE"
is reserved for
private use types. A valid type value is composed of one or more
subtags separated by hyphens and each subtag consists of three to
eight ASCII alphanumeric characters. In the current version of
CLDR,
"PRIVATE_USE"
is only used for transform
extension "x0".
valueType
The valueType attribute indicates how many
subtags are valid for a given key:
single
Only a single type value is allowed. This is the default
if no valueType attribute is present.
incremental
Multiple type values are allowed, but only if a prefix
is also present, and the sequence is explicitly listed. Each
successive type value indicates a refinement of its prefix. For
example:
description="Calendar algorithm key"
valueType="incremental"
name="islamic" description="Islamic
calendar"/>
name="islamic-umalqura" description="Islamic
calendar, Umm al-Qura"/>
Thus
ca-islamic-umalqura
is valid. However,
ca-gregory-japanese
is not valid,
because "gregory-japanese" is not listed as a type.
multiple
Multiple type values are allowed, but each may only
occur once. For example:
description="Collation reorder codes"
valueType="multiple"

any
Any number of type values are allowed, with none of the
above restrictions. For example:
extension="t" name="x0"
description="Private
use transform type key."
valueType="any"
name="PRIVATE_USE" …/>
description
The description of the key, type or attribute element. There is
also some informative text about certain keys and types in the
Section 3.5
Key And Type
Definitions
deprecated
The deprecation status of the key, type or attribute element.
The value "true" indicates the element is deprecated and no longer
used in the version of CLDR. The default value is "false".
preferred
The preferred value of the deprecated key, type or attribute
element. When a key, type or attribute element is deprecated, this
attribute is used for specifying a new canonical form if available.
alias
(Not applicable to )
The BCP47 form is the canonical form, and recommended. Other
aliases are included only for backwards compatibility.
Example:
alias="phonebook"
description="Phonebook style ordering (such as in German)"/>
The preferred term, and the only one to be used in BCP47, is the
name: in this example, "phonebk".
The alias is a key or type name used by Unicode locale extensions
with the old
"@key=type"
syntax
. The attribute value for type may contain multiple names
delimited by ASCII space characters. Of those aliases, the first
name is the preferred value.
since
The version of CLDR in which this key or type was
introduced. Absence of this attribute value implies the key or type
was available in CLDR 1.7.2.
Note: There are no values defined for the locale extension
attribute in the current CLDR release.
For example,

...

...

The data above indicates:
type "pinyin" is valid for key "co", thus "u-co-pinyin" is a
valid Unicode locale extension.
type "pinyin" is not valid for key "ka", thus "u-ka-pinyin"
is not a valid Unicode locale extension.
type "pinyin" has no
alias
, so "zh@collation=pinyin"
is a valid Unicode locale identifier according to the old syntax.
type "noignore" has an alias attribute, so
"en@colAlternate=noignore" is not a valid Unicode locale identifier
according to the old syntax.
type "aumel" is valid for key "tz", supported by CLDR 1.7.2
(default value) or later versions.
type "aumqi" is valid for key "tz", supported by CLDR 1.8.1
or later versions.
It is strongly recommended that all API methods accept all
possible aliases for keywords and types, but generate the canonical
form. For example, "ar-u-ca-islamicc" would be equivalent
to "ar-u-ca-islamic-civil" on input, but the latter should
be output. The one exception is where an alias would only be
well-formed with the old syntax, such as "gregorian" (for
"gregory").
3.6.5
Subdivision Codes
The subdivision codes designate a
subdivision of a country or region. They are called various names,
such as a
state
in the United States, or a
province
in Canada. The codes in CLDR
are based on ISO 3166-2 subdivision codes. The
ISO codes have a region code followed by a hyphen, then a suffix
consisting of 1..3 ASCII letters or digits.
The CLDR codes are designed to work in a
unicode_locale_id
(BCP47), and are
thus all lowercase, with no hyphen.
For example, the following are valid, and mean “English as used in
California, USA”.
en-u-sd-
usca
en-US-u-sd-
usca
CLDR has additional subdivision codes. These
may start with a 3-digit region code or use a suffix of 4 ASCII
letters or digits, so they will not collide with the ISO codes.
Subdivision codes for unknown values are the region code plus
"zzzz", such as "uszzzz" for an unknown
subdivision of the US. Other codes may be added for stability.
Like BCP47, CLDR requires stable codes, which are not guaranteed for
ISO 3166-2 (nor have the ISO 3166-2
codes been stable in the past). If an ISO 3166-2 code is removed, it
remains valid (though marked as deprecated) in CLDR. If an ICU 3166-2
code is reused (for the same region), then CLDR will define a new
equivalent code using these a 4-character suffixes.
3.6.5.1 Validity
unicode_subdivision_id
is only valid when it is present in the
subdivision.xml file as described in
Section 3.11
Validity Data
The data is in a compressed form, and thus needs to be expanded
before such a test is made.
Examples:
usca
is valid — there is an
id
element
… usca
…
ussct
is invalid — there is no
id
element
… ussct
…
unicode_subdivision_id
is only
valid in a
unicode_locale_id
if it starts with the
unicode_region_subtag
in the
unicode_language_id
(after adding
likely subtags, and comparing
case-insensitively).
Examples:
en-
US
-u-sd-
us
ca
is valid — the region "US" matches
en-
CA
-u-sd-
gb
sct is
invalid — the region "gb" does not match "CA"
en-u-sd-
gb
sct is invalid — after adding
likely subtags, this becomes en
-Latn-US
-u-sd-
gb
sct,
where the region "gb" does not match "US"
In version 28.0, the subdivisions in the
validity files used the ISO format, uppercase with a hyphen separating two
components, instead of the BCP47 format.
3.7 Unicode BCP 47 T Extension
The Unicode Consortium has registered and is the maintaining
authority for two BCP 47 language tag extensions: the extension 'u'
for Unicode locale extension [
RFC6067
] and
extension 't' for transformed content [
RFC6497
].
The Unicode BCP 47 extension data defines the complete list of valid
subtags.
The -t- Extension.
The syntax of 't' extension
subtags is defined by the rule
unicode_locale_extensions
in
Section 3.2
Unicode locale identifier
, except the separator of subtags
sep
must be always hyphen '-' when the extension is used as a part of BCP
47 language tag. For information about the registration process,
meaning, and usage of the 't' extension, see [
RFC6497
].
These subtags are all in lowercase (that is the canonical casing for
these subtags), however, subtags are case-insensitive and casing does
not carry any specific meaning. All subtags within the Unicode
extensions are alphanumeric characters in length of two to eight that
meet the rule
extension
in the [
BCP47
3.7.1 T Extension Data
Files
The overall structure of the data files is the similar to the U
Extension, with the following exceptions.
In the transformed content 't' data file, the name attribute in
a element defines a valid field separator subtag. The
name attribute in an enclosed element defines a valid
field subtag for the field separator subtag. For example:
description="Transform extension mechanism">
description="United Nations Group of Experts on Geographical Names"
since="21"/>

The data above indicates:
"m0" is a valid field separator for the transformed content
extension 't'.
field subtag "ungegn" is valid for field separator "m0".
field subtag "ungegn" was introduced in CLDR 21.
The attributes are:
name
The name of the mechanism, limited to 3-8 characters (or sequences
of them). Any indirect type names are
listed in 3.6.4
Extension Data Files
description
A description of the name, with all and only that
information necessary to distinguish one name from | American
Library others with which it might be confused. Descriptions are not
intended to provide general background information.
since
Indicates the first version of CLDR where the name appears.
(Required for new items.)
alias
Alternative name, not limited in number of characters. Aliases are
intended for compatibility, not to provide all possible alternate
names or designations.
(Optional)
For information about the registration process, meaning, and usage of
the 't' extension, see [
RFC6497
].
3.8 Compatibility
with Older Identifiers
LDML version before 1.7.2 used slightly different syntax for
variant subtags and locale extensions. Implementations of LDML may
provide backward compatible identifier support as described in
following sections.
3.8.1 Old Locale Extension
Syntax
LDML 1.7 or older specification used different syntax for
representing unicode locale extensions. The previous definition of
Unicode locale extensions had the following structure:
EBNF
ABNF
old_unicode_locale_extensions
= "@" old_key "=" old_type
(";" old_key "=" old_type)*
= "@" old_key "=" old_type
*(";" old_key "=" old_type)
The new specification mandates keys to be two alphanumeric
characters and types to be three to eight alphanumeric characters. As
the result, new codes were assigned to all existing keys and some
types. For example, a new key "co" replaced the previous key
"collation", a new type "phonebk" replaced the previous type
"phonebook". However, the existing collation type "big5han" already
satisfied the new requirement, so no new type code was assigned to
the type. All new keys and types introduced after LDML 1.7 satisfy
the new requirement, so they do not have aliases dedicated for the
old syntax, except time zone types. The conversion between old types
and new types can be done regardless of key, with one known exception
(old type "traditional" is mapped to new type "trad" for collation
and "traditio" for numbering system), and this relationship will be
maintained in the future versions unless otherwise noted.
The new specification introduced a new field
attribute
in addition to key/type pairs in the Unicode locale extension. When
it is necessary to map a new Unicode locale identifier with
attribute
field to a well-formed old locale identifier, a special key name
attribute
with the value of entire
attribute
subtags in the new identifier is used. For example, a new identifier
ja-u-xxx-yyy-ca-japanese
is mapped to an old identifier
ja@attribute=xxx-yyy;calendar=japanese
The chart below shows some example mappings between the new
syntax and the old syntax.
Locale Extension Mappings
Old (LDML 1.7 or older)
New
de_DE@collation=phonebook
de_DE_u_co_phonebk
zh_Hant_TW@collation=big5han
zh_Hant_TW_u_co_big5han
th_TH@calendar=gregorian;numbers=thai
th_TH_u_ca_gregory_nu_thai
en_US_POSIX@timezone=America/Los_Angeles
en_US_u_tz_uslax_va_posix
Where the old API is supplied the bcp47 language code, or vice
versa, the recommendation is to:
Have all methods that take the old syntax also take the new
syntax, interpreted correctly. For example,
"zh-TW-u-co-pinyin" and "zh_TW@collation=pinyin"
would both be interpreted as meaning the same.
Have all methods (both for old and new syntax) accept all
possible aliases for keywords and types. For example,
"ar-u-ca-islamicc" would be equivalent to
"ar-u-ca-islamic-civil".
The one exception is where an alias would only be
well-formed with the old syntax, such as "gregorian"
(for "gregory").
Where an API cannot successfully accept the alternate
syntax, throw an exception (or otherwise indicate an error) so that
people can detect that they are using the wrong method (or wrong
input).
Provide a method that tests a purported locale ID string to
determine its status:
well-formed
- syntactically correct
valid
- well-formed and only uses
registered language subtags, extensions, keywords, types...
canonical
- valid and no deprecated codes
or structure.
3.8.2 Legacy
Variants
Old LDML specification allowed codes other than registered [
BCP47
] variant subtags used in Unicode language
and locale identifiers for representing variations of locale data.
Unicode locale identifiers including such variant codes can be
converted to the new [
BCP47
] compatible
identifiers by following the descriptions below:
Legacy
Variant Mappings
Variant Code
Description
AALAND
Åland, variant of "sv" Swedish used in Finland. Use "sv_AX"
to indicate this.
BOKMAL
Bokmål, variant of "no" Norwegian. Use primary language
subtag "nb" to indicate this.
NYNORSK
Nynorsk, variant of "no" Norwegian. Use primary language
subtag "nn" to indicate this.
POSIX
POSIX variation of locale data. Use Unicode locale
extension "-u-va-posix" to indicate this.
POLYTONI
Polytonic, variant of "el" Greek. Use [
BCP47
variant subtag "polyton" to indicate this.
SAAHO
The Saaho variant of Afar. Use primary language subtag
"ssy" to indicated this.
When converting to old syntax, the Unicode locale extension
"-u-va-posix" should be converted to the "POSIX" variant,
not
to old extension syntax like "@va=posix". This is an exception: The
other mappings above should not be reversed.
Examples:
en_US_POSIX ↔ en-US-u-va-posix
en_US_POSIX@colNumeric=yes ↔ en-US-u-kn-va-posix
en-US-POSIX-u-kn-true → en-US-u-kn-va-posix
en-US-POSIX-u-kn-va-posix → en-US-u-kn-va-posix
3.8.3
Relation to OpenI18n
The locale id format generally follows the description in the
OpenI18N
Locale Naming Guideline
NamingGuideline
],
with some enhancements. The main differences from the those
guidelines are that the locale id:
does not
include a charset (since the data in LDML format always provides a
representation of all Unicode characters. The repository is stored
in UTF-8, although that can be transcoded to other encodings as
well.),
adds the
ability to have a variant, as in Java
adds the
ability to discriminate the written language by script (or script
variant).
is a
superset of [
BCP47
] codes.
3.9 Transmitting Locale
Information
In a world of on-demand software components, with arbitrary
connections between those components, it is important to get a sense
of where localization should be done, and how to transmit enough
information so that it can be done at that appropriate place.
End-users need to get messages localized to their languages, messages
that not only contain a translation of text, but also contain
variables such as date, time, number formats, and currencies
formatted according to the users' conventions. The strategy for
doing the so-called
JIT localization
is made up of two parts:
Store and transmit
neutral-format
data wherever
possible.
Neutral-format data is data that is kept in a standard
format, no matter what the local user's environment is.
Neutral-format is also (loosely) called
binary data
, even
though it actually could be represented in many different ways,
including a textual representation such as in XML.
Such data should use accepted standards where possible,
such as for currency codes.
Textual data should also be in a uniform character set
(Unicode/10646) to avoid possible data corruption problems when
converting between encodings.
Localize that data as "
close
" to the
end-user as possible.
There are a number of advantages to this strategy. The longer
the data is kept in a neutral format, the more flexible the entire
system is. On a practical level, if transmitted data is
neutral-format, then it is much easier to manipulate the data, debug
the processing of the data, and maintain the software connections
between components.
Once data has been localized into a given language, it can be
quite difficult to programmatically convert that data into another
format, if required. This is especially true if the data contains a
mixture of translated text and formatted variables. Once information
has been localized into, say, Romanian, it is much more difficult to
localize that data into, say, French. Parsing is more difficult than
formatting, and may run up against different ambiguities in
interpreting text that has been localized, even if the original
translated message text is available (which it may not be).
Moreover, the closer we are to end-user, the more we know about
that user's preferred formats. If we format dates, for example,
at the user's machine, then it can easily take into account any
customizations that the user has specified. If the formatting is done
elsewhere, either we have to transmit whatever user customizations
are in play, or we only transmit the user's locale code, which
may only approximate the desired format. Thus the closer the
localization is to the end user, the less we need to ship all of the
user's preferences around to all the places that localization
could possibly need to be done.
Even though localization should be done as close to the
end-user as possible, there will be cases where different components
need to be aware of whatever settings are appropriate for doing the
localization. Thus information such as a locale code or time zone
needs to be communicated between different components.
3.9.1 Message
Formatting and Exceptions
Windows (
FormatMessage
String.Format
),
Java (
MessageFormat
and ICU (
MessageFormat
umsg
all provide methods of formatting variables (dates, times, etc) and
inserting them at arbitrary positions in a string. This avoids the
manual string concatenation that causes severe problems for
localization. The question is, where to do this? It is especially
important since the original code site that originates a particular
message may be far down in the bowels of a component, and passed up
to the top of the component with an exception. So we will take that
case as representative of this class of issues.
There are circumstances where the message can be communicated
with a language-neutral code, such as a numeric error code or
mnemonic string key, that is understood outside of the component. If
there are arguments that need to accompany that message, such as a
number of files or a datetime, those need to accompany the numeric
code so that when the localization is finally at some point, the full
information can be presented to the end-user. This is the best case
for localization.
More often, the exact messages that could originate from within
the component are not known outside of the component itself; or at
least they may not be known by the component that is finally
displaying text to the user. In such a case, the information as to
the user's locale needs to be communicated in some way to the
component that is doing the localization. That locale information
does not necessarily need to be communicated deep within the
component; ideally, any exceptions should bundle up some
language-neutral message ID, plus the arguments needed to format the
message (for example, datetime), but not do the localization at the
throw site. This approach has the advantages noted above for JIT
localization.
In addition, exceptions are often caught at a higher level;
they do not end up being displayed to any end-user at all. By
avoiding the localization at the throw site, it the cost of doing
formatting, when that formatting is not really necessary. In fact, in
many running programs most of the exceptions that are thrown at a low
level never end up being presented to an end-user, so this can have
considerable performance benefits.
3.10
Unicode Language and Locale IDs
People have very slippery notions of what distinguishes a
language code versus a locale code. The problem is that both are
somewhat nebulous concepts.
In practice, many people use [
BCP47
] codes to
mean locale codes instead of strictly language codes. It is easy to
see why this came about; because [
BCP47
includes an explicit region (territory) code, for most people it was
sufficient for use as a locale code as well. For example, when
typical web software receives an [
BCP47
] code,
it will use it as a locale code. Other typical software will do the
same: in practice, language codes and locale codes are treated
interchangeably. Some people recommend distinguishing on the basis of
"-" versus "_" (for example,
zh-TW
for
language code,
zh_TW
for locale code), but in practice that
does not work because of the free variation out in the world in the
use of these separators. Notice that Windows, for example, uses
"-" as a separator in its locale codes. So pragmatically
one is forced to treat "-" and "_" as equivalent
when interpreting either one on input.
Another reason for the conflation of these codes is that
very
little data in most systems is distinguished by region alone;
currency codes and measurement systems being some of the few.
Sometimes date or number formats are mentioned as regional, but that
really does not make much sense. If people see the sentence "You
will have to adjust the value to १,२३४.५६७ from ૭૧,૨૩૪.૫૬"
(using Indic digits), they would say that sentence is simply not
English. Number format is far more closely associated with language
than it is with region. The same is true for date formats: people
would never expect to see intermixed a date in the format
"2003年4月1日" (using Kanji) in text purporting to be purely
English. There are regional differences in date and number format —
differences which can be important — but those are different in kind
than other language differences between regions.
As far as we are concerned —
as a completely practical matter
— two languages are different if they require substantially different
localized resources. Distinctions according to spoken form are
important in some contexts, but the written form is by far and away
the most important issue for data interchange. Unfortunately, this is
not the principle used in [
ISO639
], which has
the fairly unproductive notion (for data interchange) that only
spoken language matters (it is also not completely consistent about
this, however).
BCP47
can
express a difference
if the use of written languages happens to correspond to region
boundaries expressed as [
ISO3166
] region
codes, and has recently added codes that allow it to express some
important cases that are not distinguished by [
ISO3166
codes. These written languages include simplified and traditional
Chinese (both used in Hong Kong S.A.R.); Serbian in Latin script;
Azerbaijani in Arab script, and so on.
Notice also that
currency codes
are different than
currency
localizations
. The currency localizations should largely be in the
language-based resource bundles, not in the territory-based resource
bundles. Thus, the resource bundle
en
contains the localized
mappings in English for a range of different currency codes: USD →
US$, RUR → Rub, AUD → $A and so on. Of course, some currency symbols
are used for more than one currency, and in such cases
specializations appear in the territory-based bundles. Continuing the
example,
en_US
would have USD → $, while
en_AU
would
have AUD → $. (In protocols, the currency codes should always
accompany any currency amounts; otherwise the data is ambiguous, and
software is forced to use the user's territory to guess at the
currency. For some informal discussion of this, see
JIT
Localization
.)
3.10.1
Written Language
Criteria for what makes a written language should be purely
pragmatic;
what would copy-editors say?
If one gave them text
like the following, they would respond that is far from acceptable
English for publication, and ask for it to be redone:
"Theatre Center News: The date of the last
version of this document was 2003年3月20日. A copy can be obtained for
$50,0 or 1.234,57 грн. We would like to acknowledge contributions by
the following authors (in alphabetical order): Alaa Ghoneim, Behdad
Esfahbod, Ahmed Talaat, Eric Mader, Asmus Freytag, Avery Bishop, and
Doug Felt."
So one would change it to either B or C below, depending on
which orthographic variant of English was the target for the
publication:
"Theater Center News: The date of the last version of
this document was 3/20/2003. A copy can be obtained for $50.00 or
1,234.57 Ukrainian Hryvni. We would like to acknowledge
contributions by the following authors (in alphabetical order): Alaa
Ghoneim, Ahmed Talaat, Asmus Freytag, Avery Bishop, Behdad Esfahbod,
Doug Felt, Eric Mader."
"Theatre Centre News: The date of the last version of
this document was 20/3/2003. A copy can be obtained for $50.00 or
1,234.57 Ukrainian Hryvni. We would like to acknowledge
contributions by the following authors (in alphabetical order): Alaa
Ghoneim, Ahmed Talaat, Asmus Freytag, Avery Bishop, Behdad Esfahbod,
Doug Felt, Eric Mader."
Clearly there are many acceptable variations on this text. For
example, copy editors might still quibble with the use of first
versus last name sorting in the list, but clearly the first list was
not
acceptable English alphabetical order. And in quoting a
name, like "Theatre Centre News", one may leave it in the
source orthography even if it differs from the publication target
orthography. And so on. However, just as clearly, there limits on
what is acceptable English, and "2003年3月20日", for example,
is
not
Note that the language of locale data may differ from the
language of localized software or web sites, when those latter are
not localized into the user's preferred language. In such cases,
the kind of incongruous juxtapositions described above may well
appear, but this situation is usually preferable to forcing
unfamiliar date or number formats on the user as well.
3.11 Validity Data

) >

The directory
common/validity
contains machine-readable data for validating the language, region,
script, and variant subtags, as well as currency, subdivisions and
measure units. Each file contains a number of subtags with the
following
idStatus
values:
regular
— the standard codes used for the
specific type of subtag
special
— certain
exceptional language codes like 'mul'
(languages only)
unknown
— the code used to indicate the
"unknown", "undetermined" or "invalid"
values. For more information, see
Section 3.5.1
Unknown or Invalid
Identifiers
macroregion
— the standard codes that are
macroregions
(for regions only).
Note that some two-letter region codes are macroregions,
and (in the future) some three-digit codes may be regular codes.
For details as to which regions are contained within which
macroregions, see the

element
of the supplemental data.
deprecated
— codes that should not be used.
The

element in the supplementalMeta
file contains more information about these codes, and which codes
should be used instead.
private_use
— codes that, for CLDR, are
considered private use. Note that some BCP47 private-use codes have
defined CLDR semantics, and are considered regular codes. For more
information, see
Section 3.5.3
Private
Use Codes
The list of subtags for each idStatus use a compact format as a
space-delimited list of StringRanges, as defined in
Section
5.3.4 String Range
The separator for each StringRange is a "~".
Each measure unit is a sequence of subtags, such as
“angle-arc-minute”. The first subtag provides a general “category” of
the unit.
In version 28.0, the subdivisions in the
validity files used the ISO format, uppercase with a hyphen separating two
components, instead of the BCP47 format.
4 Locale
Inheritance and Matching
The XML format relies on an inheritance model, whereby the resources
are collected into
bundles
, and the bundles organized into a
tree. Data for the many Spanish locales does not need to be
duplicated across all of the countries having Spanish as a national
language. Instead, common data is collected in the Spanish language
locale, and territory locales only need to supply differences. The
parent of all of the language locales is a generic locale known as
root
Wherever possible, the resources in the root are language &
territory neutral. For example, the collation (sorting) order in the
root is based on the [
DUCET
] (see
Root Collation
). Since
English language collation has the same ordering as the root locale,
the 'en' locale data does not need to supply any collation
data, nor do the 'en_US', 'en_GB' or the any of the
various other locales that use English.
Given a particular locale id "en_US_someVariant", the
search chain for a particular resource is the following.
en_US_someVariant
en_US
en
root
The inheritance is often not simple truncation, as will be
seen later in this section.
If a type and key are supplied in the locale id, then logically
the chain from that id to the root is searched for a resource tag
with a given type, all the way up to root. If no resource is found
with that tag and type, then the chain is searched again without the
type.
Thus the data for any given locale will only contain resources that
are different from the parent locale. For example, most territory
locales will inherit the bulk of their data from the language locale:
"en" will contain the bulk of the data: "en_IE"
will only contain a few items like currency. All data that is
inherited from a parent is presumed to be valid, just as valid as if
it were physically present in the file. This provides for much
smaller resource bundles, and much simpler (and less error-prone)
maintenance. At the script or region level, the "primary"
child locale will be empty, since its parent will contain all of the
appropriate resources for it. For more information see
CLDR
Information : Section 9.3
Default
Content
Certain data items depend only on the region specified in a locale id
(by a
unicode_region_subtag
or
an “rg”
Region Override
key)
, and are obtained from supplemental data rather than through locale
resources. For example:
The currency for the specified region (see
Supplemental
Currency Data
The measurement system for the specified region (see
Measurement
System Data
The week conventions for the specified region (see
Week Data
(For more information on the specific
items handled this way, see
Territory-Based
Preferences
.)
These items will be correct for the specified region regardless of
whether a locale bundle actually exists with the same combination of
language and region as in the locale id. For example, suppose data is
requested for the locale id "fr_US" and there is no bundle for that
combination. Data obtained via locale inheritance, such as currency
patterns and currency symbols, will be obtained from the parent
locale "fr". However, currency amounts would be formatted by default
using US dollars, just displayed in the manner governed by the locale
"fr". When a locale id does not specify a region, the region-specific
items such as those above are obtained from the likely region for the
locale (obtained via
Likely Subtags
).
4.1 Lookup
If a language has more than one script in customary modern use,
then the CLDR file structure in common/main follows the following
model:
lang
lang_script
lang_script_region
lang_region
(aliases to lang_script_region)
4.1.1
Bundle vs Item Lookup
There are actually two different kinds of inheritance fallback:
resource bundle lookup
and
resource item lookup
. For the former, a
process is looking to find the first, best resource bundle it can;
for the later, it is fallback within bundles on individual
items, like the translated name for the region "CN" in
Breton.
These are closely related, but distinct, processes. They are
illustrated in the table
Lookup
Differences
, where "key" stands for zero or more key/type
pairs. Logically speaking, when looking up an item for a given
locale, you first do a resource bundle lookup to find the best bundle
for the locale, then you do a inherited item lookup starting with
that resource bundle.
The table
Lookup Differences
uses
the naïve resource bundle lookup for illustration. More sophisticated
systems will get far better results for resource bundle lookup if
they use the algorithm described in
Section 4.4
Language Matching
. That algorithm takes
into account both the user’s desired locale(s) and the application’s
supported locales, in order to get the best match.
If the naïve resource bundle lookup is used, the desired locale needs
to be canonicalized using 4.3
Likely
Subtags
and the supplemental alias information, so that locales that
CLDR considers identical are treated as such. Thus eng-Latn-GB should
be mapped to en-GB, and cmn-TW mapped to zh-Hant-TW.
For the purposes of CLDR, everything with the dtd
is treated logically as if it is one resource bundle, even if the
implementation separates data into separate physical resource
bundles. For example, suppose that there is a main XML file for Nama
(naq), but there are no elements for it because the
units are all inherited from root. If the elements are
separated into a separate data tree for modularity in the
implementation, the Nama resource bundle would be empty.
However, for purposes of resource-bundle lookup the resource bundle
lookup still stops at naq.xml.
Lookup
Differences
Lookup
Type
Example
Comments
Resource bundle
lookup
se-FI →
se →
default-locale* →
root
* The default-locale may have its own inheritance change;
for example, it may be "en-GB → en" In that
case, the chain is expanded by inserting the chain, resulting
in:
se-FI →
se →
fi →
en-GB →
en →
root
Inherited item
lookup
se-FI+key →
se+key →
root_alias*+key
→ root+key
* If there is a root_alias to another key or locale, then
insert that entire chain. For example, suppose that months for
another calendar system have a root alias to Gregorian months.
In that case, the root alias would change the key, and retry
from se-FI downward. This can happen multiple times.
se-FI+key →
se+key →
root_alias*+key →
se-FI+key2 →
se+key2 →
root_alias*+key2 →
root+key2
Both the resource bundle inheritance and the inherited item
inheritance use the parentLocale data, where available, instead of
simple trunctation.
The fallback is a bit different for these two cases; internal
aliases and keys are are not involved in the bundle lookup, and the
default locale is not involved in the item lookup. If the
default-locale were used in the resource-item lookup, then strange
results will occur. For example, suppose that the default locale is
Swedish, and there is a Nama locale but no specific inherited item
for collation. If the default-locale were used in resource-item
lookup, it would produce odd and unexpected results for Nama sorting.
The default locale is not even always used in resource bundle
inheritance. For the following services, the fallback is always
directly to the root locale rather than through default locale.
collation
break iteration
case mapping
transliteration
The lookup for transliteration is yet more complicated
because of the interplay of source and target locales: see
Part
2 General, Section 10.1
Inheritance.
Thus if there is no Akan locale, for example, asking for a collation
for Akan should produce the root collation,
not the Swedish
collation.
The inherited item lookup must remain stable, because the
resources are built with a certain fallback in mind; changing the
core fallback order can render the bundle structure incoherent.
Resource bundle lookup, on the other hand, is more flexible; changes
in the view of the "best" match between the input request
and the output bundle are more tolerant, when represent overall
improvements for users. For more information, see
Section 8.1 Element fallback
Where the LDML inheritance relationship does not match a target
system, such as POSIX, the data logically should be fully resolved in
converting to a format for use by that system, by adding
all
inherited data to each locale data set.
For a more complete description of how inheritance applies to data,
and the use of keywords, see
Section 4.2 Inheritance
The locale data does not contain general character properties that
are derived from the
Unicode Character Database
UAX44
]. That data
being common across locales, it is not duplicated in the bundles.
Constructing a POSIX locale from the CLDR data requires use of UCD
data. In addition, POSIX locales may also specify the character
encoding, which requires the data to be transformed into that target
encoding.
Warning:
If a locale has a different script than its parent
(for example, sr_Latn), then special attention must be paid to make
sure that all inheritance is covered. For example, auxiliary exemplar
characters may need to be empty ("[]") to block
inheritance.
Empty Override:
There is one special value reserved
in LDML to indicate that a child locale is to have no value for a
path, even if the parent locale has a value for that path. That value
is "∅∅∅". For example, if there is no phrase for "two
days ago" in a language, that can be indicated with:

∅∅∅
4.1.2 Lateral Inheritance
In clearly specified instances, resources may inherit from within the
same locale. For example, currency format symbols inherit from the
number format symbols; the Buddhist calendar inherits from the
Gregorian calendar. This
only
happens where documented in this
specification. In these special cases, the inheritance functions as
normal, up to the root. If the data is not found along that path,
then a second search is made, logically changing the
element/attribute to the alternate values.
For example, for the locale "en_US" the month data in

inherits first from in "en",
then in "root". If not found there, then it inherits from

in "en_US", then "en", then in "root".
There is one special case, for items with a "count"
parameter (used to select a plural form). In that case, the
inheritance works as follows:
If there is no value for a path, and that path has a
[@count="x"] attribute and value, then:
If "x" is anything but "other", it falls
back to [@count="other"], within that the same locale.
In the special case of currencies, if the
[@count="other"] value is missing, it falls back to the
path that is completely missing the count item.
If there is no value within the same locale, the same
process is used in the parent locale, and so on.
Examples:
Count
Fallback: normal
Locale
Path
fr-CA
//ldml/units/unitLength[@type="
narrow
"]/unit[@type="mass-gram"]/unitPattern
[@count="x"]
fr-CA
//ldml/units/unitLength[@type="
narrow
"]/unit[@type="mass-gram"]/unitPattern
[@count="other"]
fr
//ldml/units/unitLength[@type="
narrow
"]/unit[@type="mass-gram"]/unitPattern
[@count="x"]
fr
//ldml/units/unitLength[@type="
narrow
"]/unit[@type="mass-gram"]/unitPattern
[@count="other"]
root
//ldml/units/unitLength[@type="
narrow
"]/unit[@type="mass-gram"]/unitPattern
[@count="x"]
root
//ldml/units/unitLength[@type="
narrow
"]/unit[@type="mass-gram"]/unitPattern
[@count="other"]
Note that there may be an alias in root that changes the path
and starts again from the requested locale, such as:

path="../unitLength[@type='
short
']"/>

Count
Fallback: currency
Locale
Path
fr-CA
//ldml/numbers/currencies/currency[@type="CAD"]/displayName
[@count="x"]
fr-CA
//ldml/numbers/currencies/currency[@type="CAD"]/displayName
[@count="other"]
fr-CA
//ldml/numbers/currencies/currency[@type="CAD"]/displayName
fr
//ldml/numbers/currencies/currency[@type="CAD"]/displayName
[@count="x"]
fr
//ldml/numbers/currencies/currency[@type="CAD"]/displayName
[@count="other"]
fr
//ldml/numbers/currencies/currency[@type="CAD"]/displayName
root
//ldml/numbers/currencies/currency[@type="CAD"]/displayName
[@count="x"]
root
//ldml/numbers/currencies/currency[@type="CAD"]/displayName
[@count="other"]
root
//ldml/numbers/currencies/currency[@type="CAD"]/displayName
4.1.3 Parent
Locales

parentLocale parent NMTOKEN #REQUIRED

In some cases, the normal truncation inheritance does not
function well. This happens when:
The child locale is of a different script. In this case,
mixing elements from the parent into the child data results in a
mishmash.
A large number of child locales behave similarly, and
differently from the truncation parent.
The
parentLocale
element is used to
override the normal inheritance when accessing CLDR data.
For case 1, the children are script locales, and the parent is
"root". For example:

For case 2, the children and parent share the same primary
language, but the region is changed. For example:

Collation data, however, is an exception. Since collation rules
do not truly inherit data from the parent, the parentLocale element
is not necessary and not used for collation. Thus, for a locale like
zh_Hant in the example above, the parentLocale element would dictate
the parent as "root" when referring to main locale data,
but for collation data, the parent locale would still be
"zh", even though the parentLocale element is present for
that locale.
Since parentLocale information is not localizable on a per locale
basis, the parentLocale information is contained in CLDR’s
supplemental data.
When a
parentLocale
element is used to
override normal inheritance, the following invariants must always be
true:
If X is the parentLocale of Y, then either X is the root
locale, or X has the same base language code as Y. For example, the
parent of "en" cannot be "fr", and the parent of
"en_YY" cannot be "fr" or "fr_XX".
If X is the parentLocale of Y, Y must not be a base language
locale. For example, the parent of "en" cannot be
"en_XX".
There can never be cycles, such as: X parent of Y ... parent
of X.
4.2
Inheritance and Validity
The following describes in more detail how to determine the
exact inheritance of elements, and the validity of a given element in
LDML.
4.2.1 Definitions
Blocking
elements are those whose subelements do not inherit
from parent locales. For example, a element is a
blocking element: everything in a element is
treated as a single lump of data, as far as inheritance is concerned.
For more information, see
Section
5.5 Valid Attribute Values
Attributes that serve to distinguish multiple elements at the same
level are called
distinguishing
attributes. For example, the
type
attribute distinguishes different elements in lists of translations,
such as:
Afar
Abkhazian
Distinguishing attributes affect inheritance; two elements with
different distinguishing attributes are treated as different for
purposes of inheritance. For more information, see
Section 5.5 Valid Attribute
Values
. Other attributes are called nondistinguishing (or
informational) attributes. These carry separate information, and do
not affect inheritance.
For any element in an XML file,
an element chain
is a resolved
XPath
] leading from the root to an element,
with attributes on each element in alphabetical order. So in, say,
we may have:

Αραβικά
...
Which gives the following element chains (among others):
//ldml/identity/version[@number="1.1"]
//ldml/localeDisplayNames/languages/language[@type="ar"]
An element chain A is an
extension
of an element chain B if B
is equivalent to an initial portion of A. For example, #2 below is an
extension of #1. (Equivalent, depending on the tree, may not be
"identical to". See below for an example.)
//ldml/localeDisplayNames
//ldml/localeDisplayNames/languages/language[@type="ar"]
An LDML file can be thought of as an ordered list of
element
pairs
: , where the element chains are all
the chains for the end-nodes. (This works because of restrictions on
the structure of LDML, including that it does not allow mixed
content.) The ordering is the ordering that the element chains are
found in the file, and thus determined by the DTD.
For example, some of those pairs would be the following. Notice
that the first has the null string as element contents.
//ldml/identity/version[@number="1.1"]
""
//ldml/localeDisplayNames/languages/language[@type="ar"]
"Αραβικά"
Note:
There are two exceptions to this:
Blocking nodes and their contents are treated as a single
end node.
In terms of computing inheritance, the element pair
consists of the element chain plus all distinguishing attributes;
the value consists of the value (if any) plus any nondistinguishing
attributes.
Thus instead of the element pair being (a) below, it is (b):
//ldml/dates/calendars/calendar[@type='gregorian']/week/weekendStart[@day='sun'][@time='00:00']
"">
//ldml/dates/calendars/calendar[@type='gregorian']/week/weekendStart
[@day='sun'][@time='00:00']
Two LDML element chains are
equivalent
when they would be
identical if all attributes and their values were removed — except
for distinguishing attributes. Thus the following are equivalent:
//ldml/localeDisplayNames/languages/language[@type="ar"]
//ldml/localeDisplayNames/languages/language[@type="ar"][@draft="unconfirmed"]
For any locale ID, an
locale chain
is an ordered list starting
with the root and leading down to the ID. For example:

4.2.2
Resolved Data File
To produce fully resolved locale data file from CLDR for a
locale ID L, you start with L, and successively add unique items from
the parent locales until you get up to root. More formally, this can
be expressed as the following procedure.
Let Result be initially L.
For each Li in the locale chain for L, starting at L and
going up to root:
Let Temp be a copy of the pairs in the LDML file for Li
Replace each alias in Temp by the resolved list of pairs
it points to.
The resolved list of pairs is obtained by recursively
applying this procedure.
That alias now blocks any inheritance from the parent.
(See
Section 5.1 Common
Elements
for an example.)
For each element pair P in Temp:
If P does not contain a blocking element, and Result
does not have an element pair Q with an equivalent element
chain, add P to Result.
Notes:
When adding an element pair to a result, it has to go in the
right order for it to be valid according to the DTD.
The identity element and its children are unaffected by
resolution.
The LDML data must be constructed so as to avoid circularity
in step 2.2.
4.2.3 Valid Data
The attribute
draft="x"
in LDML means that the data
has not been approved by the subcommittee. (For more information, see
Process
).
However, some data that is not explicitly marked as
draft
may
be implicitly
draft
, either because it inherits it from a
parent, or from an enclosing element.
Example 2.
Suppose that new locale data is added for af
(Afrikaans). To indicate that all of the data is
unconfirmed
the attribute can be added to the top level.

number="1.1" />
type="af" />

...
...

Any data can be added to that file, and the status will all be draft=
unconfirmed
Once an item is vetted—
whether it is inherited or explicitly
in the file
—then its status can be changed to
approved
. This
can be done either by leaving draft="unconfirmed" on the
enclosing element and marking the child with
draft="approved", such as:

number="1.1" />
type="af" />

draft="approved">...
...

However, normally the draft attributes should be canonicalized, which
means they are pushed down to leaf nodes as described in
Section 5.6 Canonical Form
. If an LDML
file does has draft attributes that are not on leaf nodes, the file
should be interpreted as if it were the canonicalized version of that
file.
More formally, here is how to determine whether data for an
element chain E is implicitly or explicitly draft, given a locale L.
Sections 1, 2, and 4 are simply formalizations of what is in LDML
already. Item 3 adds the new element.
4.2.4
Checking for Draft Status
Parent Locale Inheritance
Walk through the locale chain until you find a locale ID
L' with a data file D. (L' may equal L).
Produce the fully resolved data file D' for D.
In D', find the first element pair whose element chain
E' is either equivalent to or an extension of E.
If there is no such E', return
true
If E' is not equivalent to E, truncate E' to the
length of E.
Enclosing Element Inheritance
Walk through the elements in E', from back to front.
If you ever encounter draft=
, return
If L' = L, return
false
Missing File Inheritance
Otherwise, walk again through the elements in E', from
back to front.
If you encounter a validSubLocales attribute
(deprecated):
If L is in the attribute value, return
false
Otherwise return
true
Otherwise
Return
true
The validSubLocales in the most specific (farthest from root
file) locale file "wins" through the full resolution step
(data from more specific files replacing data from less specific
ones).
4.2.5 Keyword and Default
Resolution
When accessing data based on keywords, the following process is
used. Consider the following example:
The locale 'de' has collation types A, B, C, and no
element
The locale 'de_CH' has type='B'>
Here are the searches for various combinations.
User Input
Lookup in Locale
For
Comment
de_CH
no keyword
de_CH
default collation type
finds "B"
de_CH
collation type=B
not found
de
collation type=B
found
de
no keyword
de
default collation type
not found
root
default collation type
finds "standard"
de
collation type=standard
not found
root
collation type=standard
found
de_u_co_A
de
collation type=A
found
de_u_co_standard
de
collation type=standard
not found
root
collation type=standard
found
de_u_co_foobar
de
collation type=foobar
not found
root
collation type=foobar
not found, starts looking for default
de
default collation type
not found
root
default collation type
finds "standard"
de
collation type=standard
not found
root
collation type=standard
found
Examples of "search" collator lookup; 'de' has a
language-specific version, but 'en' does not:
User Input
Lookup in Locale
For
Comment
de_CH_u_co_search
de_CH
collation type=search
not found
de
collation type=search
found
en_US_u_co_search
en_US
collation type=search
not found
en
collation type=search
not found
root
collation type=search
found
Examples of lookup for Chinese collation types. Note:
All of the Chinese-specific collation types are provided in
the 'zh' locale
For 'zh' the element specifies
"pinyin"; for 'zh_Hant' the element
specifies "stroke". However any of the available Chinese
collation types can be explicitly requested for any Chinese locale.
User Input
Lookup in Locale
For
Comment
zh_Hant
no keyword
zh_Hant
default collation type
finds "stroke"
zh_Hant
collation type=stroke
not found
zh
collation type=stroke
found
zh_Hant_HK_u_co_pinyin
zh_Hant_HK
collation type=pinyin
not found
zh_Hant
collation type=pinyin
not found
zh
collation type=pinyin
found
zh
no keyword
zh
default collation type
finds "pinyin"
zh
collation type=pinyin
found
Note:
It is an invariant that the default in root for a given
element must
always be a value that exists in root. So you
can not have the following in root:

type='a'/>
type='b'>...
...

For identifiers, such as language codes, script codes, region
codes, variant codes, types, keywords, currency symbols or currency
display names, the default value is the identifier itself whenever if
no value is found in the root. Thus if there is no display name for
the region code 'QA' in root, then the display name is simply
'QA'.
4.3 Likely
Subtags

likelySubtag from NMTOKEN #REQUIRED>
likelySubtag to NMTOKEN #REQUIRED>
There are a number of situations where it is useful to be able
to find the most likely language, script, or region. For example,
given the language "zh" and the region "TW", what
is the most likely script? Given the script "Thai" what is
the most likely language or region? Given the region TW, what is the
most likely language and script?
Conversely, given a locale, it is useful to find out which
fields (language, script, or region) may be superfluous, in the sense
that they contain the likely tags. For example, "en_Latn"
can be simplified down to "en" since "Latn" is
the likely script for "en"; "ja_Jpan_JP" can be
simplified down to "ja".
The
likelySubtag
supplemental data provides default
information for computing these values. This data is based on the
default content data, the population data, and the the
suppress-script data in [
BCP47
]. It is
heuristically derived, and may change over time.
To look up data in the table, see if a locale matches one of the
from
attribute values. If so, fetch the corresponding
to
attribute
value. For example, the Chinese data looks like the following:

to="zh_Hant_HK"/>
from="zh_Hani" to="zh_Hani_CN"/>
to="zh_Hant_TW"/>
from="zh_MO" to="zh_Hant_MO"/>
to="zh_Hant_TW"/>
So looking up "zh_TW" returns "zh_Hant_TW",
while looking up "zh" returns "zh_Hans_CN".
In more detail, the data is designed to be used in the
following operations.
Note that as of CLDR v24, any field present in the 'from' field, is
also present in the 'to' field, so an input field will not change in
"Add Likely Subtags" operation. The data and operations can
also be used with language tags using [
BCP47
syntax, with the appropriate changes. In addition, certain common
'denormalized' language subtags such as 'iw' (for 'he') may occur in
both the 'from' and 'to' fields. This allows for implementations that
use those denormalized subtags to use the data with only minor
changes to the operations.
Add Likely Subtags:
Given a source locale X,
to return a locale Y where the empty subtags have been filled in by
the most likely subtags.
This is written as X ⇒ Y ("X maximizes
to Y").
A subtag is called
empty
if it is a missing script or region
subtag, or it is a base language subtag with the value
"und". In the description below, a subscript on a subtag
indicates which tag it is from:
is in the
source,
is in a match, and
is in the final result.
This operation is performed in the following way.
Canonicalize.
Make sure the input locale is in canonical form: uses the
right separator, and has the right casing.
Replace
any deprecated subtags with their canonical values using the
data in supplemental metadata. Use the first value
in the replacement list, if it exists. Language tag replacements
may have multiple parts, such as "sh" ➞
"sr_Latn" or mo" ➞ "ro_MD". In such a
case, the original script and/or region are retained if there is
one. Thus "sh_Arab_AQ" ➞ "sr_Arab_AQ", not
"sr_Latn_AQ".
If the tag is grandfathered (see id="$grandfathered" type="choice"> in the
supplemental data), then return it.
Remove the script code 'Zzzz' and the region code
'ZZ' if they occur.
Get the components of the cleaned-up source tag
(language
script
and
region
), plus any variants and extensions.
Lookup.
Lookup each of the following in order, and stop on the first match:
language
_script
_region
language
_region
language
_script
language
und
_script
Return
If there is no match,either return
an error value, or
the match for "und" (in APIs where a valid
language tag is required).
Otherwise there is a match =
language
_script
_region
Let x
= x
if x
is not
empty, and x
otherwise.
eturn
the language tag composed of
language
script
_ region
+ variants + extensions
The lookup can be optimized. For example, if any of the tags in
Step 2 are the same as previous ones in that list, they do not need
to be tested.
Example1:
Input is ZH-ZZZZ-SG.
Normalize to zh_SG.
Lookup in table. No match.
Lookup zh, and get the match (zh_Hans_CN). Substitute SG, and
return zh_Hans_SG.
To find the most likely language for a country, or language for
a script, use "und" as the language subtag. For example,
looking up "und_TW" returns zh_Hant_TW.
A goal of the algorithm is that if X ⇒ Y, and X' results from
replacing an empty subtag in X by the the corresponding subtag in Y,
then X' ⇒ Y. For example, if und_AF ⇒ fa_Arab_AF, then:
fa_Arab_AF ⇒ fa_Arab_AF
und_Arab_AF ⇒ fa_Arab_AF
fa_AF ⇒ fa_Arab_AF
There are a small number of exceptions to this goal in the
current data, where X ∈ {und_Bopo, und_Brai, und_Cakm, und_Limb,
und_Shaw}.
Remove
Likely Subtags:
Given a locale,
remove any fields that Add Likely Subtags would add.
The reverse operation removes fields that would be added by the
first operation.
First get
max = AddLikelySubtags(inputLocale). If an error is signaled, return
it.
Remove the
variants from max.
Then for
trial
in {language, language _ region, language _ script}
If
AddLikelySubtags(
trial
) = max, then return
trial
variants.
If you do
not get a match, return max + variants.
Example:
Input is zh_Hant. Maximize to get zh_Hant_TW.
zh => zh_Hans_CN. No match, so continue.
zh_TW => zh_Hant_TW. Matches, so return zh_TW.
A variant of this favors the script over the region, thus using
{language, language_script, language_region} in the above. If that
variant is used, then the result in this example would be zh_Hant
instead of zh_TW.
4.4 Language
Matching

languageMatch desired CDATA #REQUIRED >
languageMatch supported CDATA #REQUIRED >
languageMatch percent NMTOKEN #REQUIRED >
languageMatch oneway ( true | false ) #IMPLIED >
Implementers are often faced with the issue of how to match the
user's requested languages with their product's supported languages.
For example, suppose that a product supports {ja-JP, de, zh-TW}. If
the user understands written American English, German, French, Swiss
German, and Italian, then
de
would be the best
match; if s/he understands only Chinese (zh), then zh-TW would be the
best match.
The standard truncation-fallback algorithm does not work well
when faced with the complexities of natural language. The language
matching data is designed to fill that gap. Stated in those terms,
language matching can have the effect of a more complex fallback,
such as:
sr-Cyrl-RS
sr-Cyrl
sr-Latn-RS
sr-Latn
sr
hr-Latn
hr
Language matching is used to find the best supported locale ID
given a requested list of languages. The requested list could come
from different sources, such as such as the user's list of preferred
languages in the OS Settings, or from a browser Accept-Language list.
For example, if my native tongue is English, I can understand Swiss
German and German, my French is rusty but usable, and Italian basic,
ideally an implementation would allow me to select {gsw, de, fr} as
my preferred list of languages, skipping Italian because my
comprehension is not good enough for arbitrary content.
Language Matching can also be used to get fallback data elements. In
many cases, there may not be full data for a particular locale. For
example, for a Breton speaker, the best fallback if data is
unavailable might be French. That is, suppose we have found a Breton
bundle, but it does not contain translation for the key "CN"
(for the country China). It is best to return "chine",
rather than falling back to the value default language such as Russian
and getting "Кітай". The language matching data can be
used to get the closest fallback locales (of those supported) to a
given language.
When such fallback is used for inherited item lookup, the normal
order of inheritance is used for inherited item lookup, except that
before using any data from
root
, the data for the
fallback locales would be used if available. Language matching does
not interact with the fallback of resources
within the
locale-parent chain
. For example, suppose that we are looking for
the value for a particular path
in
nb-NO
In the absence of aliases, normally the following lookup is used.
nb-NO
nb
root
That is, we first look in
nb-NO
. If there is no
value for
there, then we look in
nb
If there is no value for
there, we return the
value for
in root (or a code value, if there is
nothing there). Remember that if there is an alias element along this
path, then the lookup may restart with a different path in
nb-NO
(or another locale).
However, suppose that
nb-NO
has the fallback values
[nn da sv en]
, derived from language matching. In
that case, an implementation
may
progressively lookup each
of the listed locales, with the appropriate substitutions, returning
the first value that is not found in
root
. This
follows roughly the following pseudocode:
value = lookup(P, nb-NO); if (locationFound != root) return
value;
value = lookup(P, nn-NO); if (locationFound != root) return
value;
value = lookup(P, da-NO); if (locationFound != root) return
value;
value = lookup(P, sv-NO); if (locationFound != root) return
value;
value = lookup(P, en-NO); return value;
The locales in the fallback list are not used recursively. For
example, for the lookup of a path in nb-NO, if
fr
were a fallback value for
da
, it would not matter
for the above process. Only the original language matters.
The language matching data is intended to be used according to
the following algorithm. This is a logical description, and can be
optimized for production in many ways. In this algorithm, the
languageMatching data is interpreted as an ordered list.
The language matching algorithm takes a list of a user’s
desired languages, and a list of the application’s supported
languages.
Set the best weighted distance BWD to ∞
Set the best desired language BD to null
For each desired language D
Compute a discount factor F, based on the position in the
list.
This discount factor is up to the implementation, but is
typically a positive value that increases according to how far D
is from the start of the desired language list.
For each supported language S
Find the matching distance MD as described below.
Compute the weighted distance as F + MD
If WD < BD
BWD = WD
BD = D
If the BWD is less than a threshold, return BD.
The threshold is implementation-defined, typically set to
greater than a default region difference, and less than a default
script difference.
Otherwise return a default supported language (like
English).
To find the matching distance MD between any two languages,
perform the following steps.
Maximize each language using Section 4.3
Likely Subtags
und is a special case: see below.
Set the match-distance MD to 0
For each subtag in the list, starting from the end: region,
script, base-language
If respective subtags in each language tag are identical,
remove the subtag from each (logically) and continue.
Traverse the languageMatching data until a match is found.
* matches any field.
If the oneway flag is false, then the match is
symmetric.
Add 100 minus the
percent
attribute value
to MD.
Remove the subtag from each (logically)
Return MD
It is typically useful to set the discount factor between successive
elements of the desired languages list to be slightly greater than
the default region difference. That avoids the following problem:
Supported languages:
"de, fr, ja"
User's desired languages:
"de-AT, fr"
This user would expect to get "de", not "fr". In practice, when
a user selects a list of preferred languages, they don't include all
the regional variants ahead of their second base language. Yet while
the user's desired languages really doesn't tell us the priority
ranking among their languages, normally the fall-off between the
user's languages is substantially greater than regional variants. But
unless F is greater than the distance between de-AT and de-DE, then
the user’s second-choice language would be returned.
The base language subtag "und" is a special case.
Suppose we have the following situation:
desired languages: {und, it}
supported languages: {en, it}
resulting language: en
Part of this is because 'und' has a special function in BCP47;
it stands in for 'no supplied base language'. To prevent this from
happening, if the desired base language is und, the language matcher
should not apply likely subtags to it.
Examples:
For example, suppose that nn-DE and nb-FR are being compared.
They are first maximized to nn-Latn-DE and nb-Latn-FR, respectively.
The list is searched. The first match is with "*-*-*", for
a match of 96%. The languages are truncated to nn-Latn and nb-Latn,
then to nn and nb. The first match is also for a value of 96%, so the
result is 92%.
Note that language matching is orthogonal to the how closely
two languages are related linguistically. For example, Breton is more
closely related to Welsh than to French, but French is the better
match (because it is more likely that a Breton reader will understand
French than Welsh). This also illustrates that the matches are often
asymmetric: it is not likely that a French reader will understand
Breton.
The "*" acts as a wild card, as shown in the
following example:
supported="es-*-ES" percent="100"/>

supported="es-*-*" percent="93"/>
supported="*" percent="1"/>

supported="*-*" percent="20"/>

supported="*-*-*" percent="96"/>

When the language+region is not matched, and there is otherwise
no reason to pick among the supported regions for that language, then
some measure of geographic "closeness" can be used. The
results may be more understandable by users. Looking for en-SK, for
example, should fall back to something within Europe (eg en-GB) in
preference to something far away and unrelated (eg en-SG). Such a
closeness metric does not need to be exact; a small amount of data
can be used to give an approximate distance between any two regions.
However, any such data must be used carefully; although Hong Kong is
closer to India than to the UK, it is unlikely that en-IN would be a
better match to en-HK than en-GB would.
5 XML Format
There are two kinds of data that can be expressed in LDML:
language-dependent data and supplementary data. In either case, data
can be split across multiple files, which can be in multiple
directory trees.
For example, the language-dependent data for Japanese in CLDR
is present in the following files:
common/collation/ja.xml
common/main/ja.xml
common/rbnf/ja.xml
common/segmentations/ja.xml
Data for cased languages such as French are in files like:
common/casing/fr.xml
The status of the data is the same, whether or not data is
split. That is, for the purpose of validation and lookup, all of the
data for the above ja.xml files is treated as if it was in a single
file. These files have the root element and use
ldml.dtd. The file name must match the identity element. For example,
the file pa_Arab_PK.xml must contain the following
elements: