Unicode Locale Data Markup Language (LDM

Unicode Locale Data Markup Language (LDML)
Technical Reports
Unicode Technical Standard #35
Unicode Locale Data Markup Language (LDML)
Version
47
Editors
Mark Davis (
markdavis@google.com
) and
other CLDR committee members
Date
2025-03-11
This Version
Previous Version
Latest Version
Corrigenda
Latest Proposed Update
Namespace
DTDs
Change History
Modifications
Summary
This document describes an XML format (
vocabulary
) for the exchange of structured locale data. This format is used in the
Unicode Common Locale Data Repository
Status
This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium.
This is a stable document and may be used as reference material or cited as a normative reference by other specifications.
A Unicode Technical Standard (UTS)
is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.
Please submit corrigenda and other comments with the CLDR bug reporting form [
Bugs
].
Related information that is useful in understanding this document is found in the
References
For the latest version of the Unicode Standard see [
Unicode
].
For more information see
About Unicode Technical Reports
and the
Specifications FAQ
Unicode Technical Reports are governed by the Unicode
Parts
The LDML specification is divided into the following parts:
Part 1:
Core
(languages, locales, basic structure)
Part 2:
General
(display names & transforms, etc.)
Part 3:
Numbers
(number & currency formatting)
Part 4:
Dates
(date, time, time zone formatting)
Part 5:
Collation
(sorting, searching, grouping)
Part 6:
Supplemental
(supplemental data)
Part 7:
Keyboards
(keyboard mappings)
Part 8:
Person Names
(person names)
Part 9:
MessageFormat
(message format)
Contents of Part 1, Core
Introduction
Conformance
Unicode Locale Identifiers
Unicode Locale Inheritance and Matching
Units of Measurement
Number Formatting
Date Formatting
Collation
Grammar
Miscellaneous
Customization
Omitting data
Adding data
Overriding data
Testing
EBNF
What is a Locale?
Unicode Language and Locale Identifiers
Unicode Language Identifier
Unicode Locale Identifier
Canonical Unicode Locale Identifiers
BCP 47 Conformance
BCP 47 Language Tag Conversion
Table:
BCP 47 Language Tag to Unicode BCP 47 Locale Identifier
Examples
Unicode Locale Identifier: CLDR to BCP 47
Unicode Locale Identifier: BCP 47 to CLDR
Truncation
Language Identifier Field Definitions
unicode_language_subtag
(also known as a
Unicode base language code
unicode_script_subtag
(also known as a
Unicode script code
unicode_region_subtag
(also known as a
Unicode region code,
or a
Unicode territory code
unicode_variant_subtag
(also known as a
Unicode language variant code
Special Codes
Unknown or Invalid Identifiers
Numeric Codes
Private Use Codes
Table:
Private Use Codes in CLDR
Special Script Codes
Unicode BCP 47 U Extension
Key And Type Definitions
Table:
Key/Type Definitions
Numbering System Data
Time Zone Identifiers
U Extension Data Files
Subdivision Codes
Validity
Unicode BCP 47 T Extension
T Extension Data Files
Compatibility with Older Identifiers
Old Locale Extension Syntax
Table:
Locale Extension Mappings
Legacy Variants
Table:
Legacy Variant Mappings
Relation to OpenI18n
Transmitting Locale Information
Message Formatting and Exceptions
Unicode Language and Locale IDs
Written Language
Hybrid Locale Identifiers
Validity Data
Locale Inheritance and Matching
Lookup
Bundle vs Item Lookup
Table:
Lookup Differences
Lateral Inheritance
Table:
Count Fallback: normal
Table:
Count Fallback: currency
Inheritance Marker
Parent Locales
Region-Priority Inheritance
Inheritance and Validity
Definitions
Resolved Data File
Valid Data
Checking for Draft Status
Keyword and Default Resolution
Inheritance vs Related Information
Likely Subtags
Language Matching
Enhanced Language Matching
XML Format
Common Elements
Element special
Sample Special Elements
Element alias
Table:
Inheritance with
source="locale"
Element displayName
Escaping Characters
Common Attributes
Attribute type
Attribute draft
Attribute alt
Attribute references
Common Structures
Date and Date Ranges
Text Directionality
Unicode Sets
UnicodeSet syntax
Syntax Special Case Examples
Lists of Code Points
Backslash Escapes
Unicode Properties
Boolean Operations
Variables in UnicodeSets
UnicodeSet Examples
String Range
Identity Elements
Valid Attribute Values
Canonical Form
Content
Ordering
Comments
DTD Annotations
Property Data
Script Metadata
Extended Pictographic
Labels.txt
Segmentation Tests
Issues in Formatting and Parsing
Lenient Parsing
Motivation
Loose Matching
Handling Invalid Patterns
Data Size Reduction
Vertical Slicing
Horizontal Slicing
Annex A Deprecated Structure
A.1 Element fallback
A.2 BCP 47 Keyword Mapping
A.3 Choice Patterns
A.4 Element default
A.5 Deprecated Common Attributes
A.5.1 Attribute standard
A.5.2 Attribute draft in non-leaf elements
A.6 Element base
A.7 Element rules
A.8 Deprecated subelements of

A.9 Deprecated subelements of

A.10 Deprecated subelements of

A.11 Deprecated subelements of

and

A.12 Renamed attribute values for

element
A.13 Deprecated subelements of

A.14 Element cp
A.15 Attribute validSubLocales
A.16 Elements postalCodeData, postCodeRegex
A.17 Element telephoneCodeData
Annex B Links to Other Parts
Table:
Part 2 Links
General
(display names & transforms, etc.)
Table:
Part 3 Links
Numbers
(number & currency formatting)
Table:
Part 4 Links
Dates
(date, time, time zone formatting)
Table:
Part 5 Links
Collation
(sorting, searching, grouping)
Table:
Part 6 Links
Supplemental
(supplemental data)
Table:
Part 7 Links
Keyboards
(keyboard mappings)
Annex C. LocaleId Canonicalization
LocaleId Definitions
1. Multimap interpretation
2. Alias elements
Matches
4. Replacement
Territory Exception
5. Canonicalizing Syntax
Preprocessing
Processing LanguageIds
Processing LocaleIds
Optimizations
References
Acknowledgments
Modifications
Locale identifiers
Number symbols and formats without numberSystem
Clarified
currencyData
element ordering
Semantic Datetime Skeletons
Timezones
Unit Identifiers
DTD Annotations
Documented Inheritance Marker
Improvements to Keyboard Transforms
MessageFormat
Introduction
Not long ago, computer systems were like separate worlds, isolated from one another. The internet and related events have changed all that. A single system can be built of many different components, hardware and software, all needing to work together. Many different technologies have been important in bridging the gaps; in the internationalization arena, Unicode has provided a lingua franca for communicating textual data. However, there remain differences in the locale data used by different systems.
The best practice for internationalization is to store and communicate language-neutral data, and format that data for the client. This formatting can take place on any of a number of the components in a system; a server might format data based on the user's locale, or it could be that a client machine does the formatting. The same goes for parsing data, and locale-sensitive analysis of data.
But there remain significant differences across systems and applications in the locale-sensitive data used for such formatting, parsing, and analysis. Many of those differences are simply gratuitous; all within acceptable limits for human beings, but yielding different results. In many other cases there are outright errors. Whatever the cause, the differences can cause discrepancies to creep into a heterogeneous system. This is especially serious in the case of collation (sort-order), where different collation caused not only ordering differences, but also different results of queries! That is, with a query of customers with names between "Abbot, Cosmo" and "Arnold, James", if different systems have different sort orders, different lists will be returned. (For comparisons across systems formatted as HTML tables, see [
Comparisons
].)
Note:
There are many different equally valid ways in which data can be judged to be "correct" for a particular locale. The goal for the common locale data is to make it as consistent as possible with existing locale data, and acceptable to users in that locale.
This document specifies an XML format for the communication of locale data: the Unicode Locale Data Markup Language (LDML). This provides a common format for systems to interchange locale data so that they can get the same results in the services provided by internationalization libraries. It also provides a standard format that can allow users to customize the behavior of a system. With it, for example, collation (sorting) rules can be exchanged, allowing two implementations to exchange a specification of tailored collation rules. Using the same specification, the two implementations will achieve the same results in comparing strings. Unicode LDML can also be used to let a user encapsulate specialized sorting behavior for a specific domain, or create a customized locale for a minority language. Unicode LDML is also used in the Unicode Common Locale Data Repository (CLDR). CLDR uses an open process for reconciling differences between the locale data used on different systems and validating the data, to produce with a useful, common, consistent base of locale data.
For more information, see the Common Locale Data Repository project page [
LocaleProject
].
As LDML is an interchange format, it was designed for ease of maintenance and simplicity of transformation into other formats, above efficiency of run-time lookup and use. Implementations should consider converting LDML data into a more compact format prior to use.
Conformance
There are many ways to use the Unicode LDML specification and the CLDR data.
The Unicode Consortium does not restrict the ways in which the format or data are used.
However, an implementation may also claim conformance to the LDML specification and/or to CLDR data, as follows:
UAX35-C1.
An implementation that claims conformance to this specification shall:
Identify the sections of the specification that it conforms to.
For example, an implementation might claim conformance to all LDML features except for
transforms
and
segments
The names of sections may change for clarity, so the associated links should be included in any reference — links into LDML will remain stable.
Interpret the relevant elements and attributes of LDML data in accordance with the descriptions in those sections.
For example, an implementation that claims conformance to the date format patterns must interpret the characters in such patterns according to
Date Field Symbol Table
Declare which types of CLDR data it uses.
For example, an implementation might declare that it only uses language names, and those with a
draft
status of
contributed
or
approved
Declare when it overrides CLDR data, or uses
alt
data
For example, for
//ldml/numbers/symbols/group
an implementation could use
alt="official"
data.
An implementation may also make a
general claim
of conformance to the LDML specification and/or CLDR data.
Such a claim is understood to claim conformance to all portions of this specification that are relevant to the operations performed by the implementation,
except for those specifically declared as exceptions.
For example, if an implementation making a
general claim
of conformance performs date formatting, and does not declare date formatting as an exception,
it is understood to be claiming conformance to date formatting as described in the section listed below.
UAX35-C2.
An implementation that claims conformance to Unicode locale or language identifiers shall:
1. Specify whether Unicode locale extensions are allowed
2. Specify the canonical form used for identifiers in terms of casing and field separator characters.
External specifications may also reference particular components of Unicode locale or language identifiers, such as:
Field X can contain any Unicode region subtag values as given in Unicode Technical Standard #35: Unicode Locale Data Markup Language (LDML), excluding grouping codes.
NOTE:
UAX35-C2.
is replaced by the following generalization.
The following lists the high-level sections with structures and/or processing algorithms.
Conformance to a particular section may reference and require conformance to another section.
Unicode Locale Identifiers
Sections
Topics
Unicode Locale Identifier
identifier syntax, interpretation, and validity
Annex C. LocaleId Canonicalization
canonicalize
CLDR to BCP 47
BCP 47 to CLDR
convert
Language Identifier Field Definitions
interpretation and validity of -u key-value pairs
Locale Display Name Algorithm
locale display names
Unicode Locale Inheritance and Matching
Sections
Topics
Locale Inheritance and Matching
locale inheritance
Likely Subtags
likely subtags
Language Matching
locale matching
Units of Measurement
Sections
Topics
Unit Identifiers
unit identifier syntax, interpretation, and validity
Unit Identifier Normalization
identifier normalization
Unit Conversion
unit conversion
Unit Preferences
evaluation of user preferences
Unit Identifier Uniqueness
converting units into BCP47 format
Compound Units
unit display names
Number Formatting
Sections
Topics
Number Format Patterns
number format patterns, syntax and interpretation
Compact Number Formats
compact number formats
Rule-Based Number Formatting
spell-out number formatting
Date Formatting
Sections
Topics
Elements availableFormats, appendItems
date formatting, patterns
Date Format Patterns
date format patterns and symbols
Using Time Zone Names
timezone forms, fallback and parsing
Collation
Sections
Topics
Root Collation
Root collation syntax and structure
Collation Tailorings
Rule syntax and interpretation for language-specific ordering
Grammar
Sections
Topics
Grammatical Features
noun classes (except for plurals)
Language Plural Rules
plural and ordinal category rules, ranges
Miscellaneous
Sections
Topics
Unicode Sets
Unicode set syntax and interpretation
String Range
string-range syntax and interpretation
Transforms
transform identifier and rule syntax and interpretation
Segmentations
segmentation customizations
Synthesizing Sequence Names
constructing derived emoji names
Formatting Process
person name formatting
Part 7: Keyboards
keyboard structure and interpretation
Conformance
(Message Format)
message formatting
Customization
Conformant implementations cannot modify CLDR structures, such as the syntax or interpretation of locale identifiers.
There are usually mechanisms for implementations to customize these to a certain extent, using what are known a private use codes.
For example, an implementation could use the private-use language code
qfz
to mean a language that was not covered by BCP 47,
or use a
private use extension
in a Unicode locale identifer, or use a private-use unit such as
xxx-smoot-per-second
An implementation may also use a deprecated code instead of the corresponding preferred code.
For example, the most frequent case of this is with an implementation whose earlier versions predated BCP 47, and used
iw
for Hebrew,
rather than the BCP 47 (and CLDR) code
he
When this is done, the CLDR data needs to be modified in appropriate places, not just in some file names.
For example, the languageAlias data requires modification, from:

to

Minimized locale identifiers are also not required. For example, an implementation could consistently expand locale identifiers to include regions, such as
en
en_DE
or
de
de-AT
Implementations may customize CLDR data, as long as they declare that they are doing so. This may include:
Omitting data
An implementation may dispense with locale data for locales that an implementation does not support, or for locales it does support,
dispense with data that is at CoverageLevel=Comprehensive, or dispense with particular sorts of data, such a annotations for emoji.
Adding data
An implementation could add data for a locale that CLDR does not yet support, or add higher-coverage data for a locale than what CLDR has.
Overriding data
CLDR has a mechanism for overriding data using the
alt
mechanism.
At build time, an implementation could override the default value by using an alt value.
For example, take the following data:
Sonderverwaltungsregion Hongkong
Hongkong
An implementation could, at build time, substitute the short value for the regular value, getting "Hongkong".
It could instead support both values at runtime, using display option settings to pick between the regular value and the short value.
Implementations can override the data in other ways as well, such as changing the spelling of a particular value.
Testing
The files in
testData
can be used to test conformance.
Brief instructions for use are supplied in
_readme.txt
files in the different directories and/or in the headers of the files in question.
For example, the following is from a sample header:
# Format:
# ;
# The data lines are divided into 4 sets:
# explicit: a short list of explicit test cases.
# fromAliases: test cases generated from the alias data.
# decanonicalized: test cases generated by reversing the normalization process.
# withIrrelevants: test cases generated from the others by adding irrelevant fields where possible,
# to ensure that the canonicalization implementation is not sensitive to irrelevant fields. These include:
# Language: aaa
# Script: Adlm
# Region: AC
# Variant: fonipa
If an implementation overrides CLDR data, then various lines in the relevant test files may need to be modified correspondingly, or skipped.
EBNF
The EBNF syntax used in LDML is a variant of the Extended Backus-Naur Form (EBNF) notation used in
W3C XML Notation
. The main differences are:
Bounded repetition following Perl regex syntax is allowed, such as
digit{3}
for 3 digits,
digit{3,5}
for 3 to 5 digits, and
digit{3,}
for 3 or more digits.
Whitespace inside bracketed enumerations and ranges is ignored.
eg.,
[A-Z a-z]
is the same as
[A-Za-z]
A backslash may be used to escape a following "x"-prefixed hexadecimal code point or the immediately following character.
eg.,
\x20
is the same as
#x20
and
[\&\-]
is the same as
[#x26#x2D]
Constraints (well-formedness or validity) may use separate notes, and/or the W3C notations:
[ wfc: ... ]
[ vc: ... ]
In the text, this is sometimes referred to as "EBNF (Perl-based)".
What is a Locale?
Before diving into the XML structure, it is helpful to describe the model behind the structure. People do not have to subscribe to this model to use data in LDML, but they do need to understand it so that the data can be correctly translated into whatever model their implementation uses.
The first issue is basic:
what is a locale?
In this model, a locale is an identifier (id) that refers to a set of user preferences that tend to be shared across significant swaths of the world. Traditionally, the data associated with this id provides support for formatting and parsing of dates, times, numbers, and currencies; for measurement units, for sort-order (collation), plus translated names for time zones, languages, countries (regions), and scripts. The data can also include support for text boundaries (character, word, line, and sentence), text transformations (including transliterations), and other services.
Locale data is not cast in stone: the data used on someone's machine generally may reflect the US format, for example, but preferences can typically set to override particular items, such as setting the date format for 2002.03.15, or using metric or Imperial measurement units. In the abstract, locales are simply one of many sets of preferences that, say, a website may want to remember for a particular user. Depending on the application, it may want to also remember the user's time zone, preferred currency, preferred character set, smoker/non-smoker preference, meal preference (vegetarian, kosher, and so on), music preference, religion, party affiliation, favorite charity, and so on.
Locale data in a system may also change over time: country boundaries change; governments (and currencies) come and go: committees impose new standards; bugs are found and fixed in the source data; and so on. Thus the data needs to be versioned for stability over time.
In general terms, the locale id is a parameter that is supplied to a particular service (date formatting, sorting, spell-checking, and so on). The format in this document does not attempt to represent all the data that could conceivably be used by all possible services. Instead, it collects together data that is in common use in systems and internationalization libraries for basic services. The main difference among locales is in terms of language; there may also be some differences according to different countries or regions. However, the line between
locales
and
languages
, as commonly used in the industry, are rather fuzzy. Note also that the vast majority of the locale data in CLDR is in fact language data; all non-linguistic data is separated out into a separate tree. For more information, see
Language and Locale IDs
We will speak of data as being "in locale X". That does not imply that a locale
is
a collection of data; it is simply shorthand for "the set of data associated with the locale id X". Each individual piece of data is called a
resource
or
field
, and a tag indicating the key of the resource is called a
resource tag.
Unicode Language and Locale Identifiers
Unicode LDML uses stable identifiers based on [
BCP47
] for distinguishing among languages, locales, regions, currencies, time zones, transforms, and so on. There are many systems for identifiers for these entities. The Unicode LDML identifiers may not match the identifiers used on a particular target system. If so, some process of identifier translation may be required when using LDML data.
The BCP 47 extensions (-u- and -t-) are described in
Unicode BCP 47 U Extension
and
Unicode BCP 47 T Extension
Unicode Language Identifier
Unicode language identifier
has the following structure (provided in EBNF (Perl-based)). The following table defines syntactically well-formed identifiers: they are not necessarily valid identifiers. For additional validity criteria, see the links on the right.
EBNF
Validity / Comments
unicode_language_id
= "root"
| (unicode_language_subtag
(sep unicode_script_subtag)?
| unicode_script_subtag)
(sep unicode_region_subtag)?
(sep unicode_variant_subtag)* ;
"root" is treated as a special
unicode_language_subtag
unicode_language_subtag
= alpha{2,3} | alpha{5,8};
validity
latest-data
unicode_script_subtag
= alpha{4} ;
validity
latest-data
unicode_region_subtag
= (alpha{2} | digit{3}) ;
validity
latest-data
unicode_variant_subtag
= (alphanum{5,8}
| digit alphanum{3}) ;
validity
latest-data
sep
= [-_] ;
digit
= [0-9] ;
alpha
= [A-Z a-z] ;
alphanum
= [0-9 A-Z a-z] ;
The following is an additional well-formedness constraint:
[ wfc: The sequence of variant subtags must not have any duplicates (eg, de-1996-fonipa-1996 is not syntactically well-formed). ]
The semantics of the various subtags is explained in
Language Identifier Field Definitions
; there are also direct links from
unicode_language_subtag
, etc. While theoretically the
unicode_language_subtag
may have more than 3 letters through the IANA registration process, in practice that has not occurred. The
unicode_language_subtag
"und" may be omitted when there is a
unicode_script_subtag
; for that reason
unicode_language_subtag
values with 4 letters are not permitted. However, such
unicode_language_id
values are not intended for general interchange, because they are not valid BCP 47 tags. Instead, they are intended for certain protocols such as the identification of transliterators or font ScriptLangTag values. For more information on language subtags with 4 letters, see
BCP 47 Language Tag to Unicode BCP 47 Locale Identifier
For example, "en-US" (American English), "en_GB" (British English), "es-419" (Latin American Spanish), and "uz-Cyrl" (Uzbek in Cyrillic) are all valid Unicode language identifiers.
Unicode Locale Identifier
Unicode locale identifier
is composed of a Unicode language identifier plus (optional) locale extensions. It has the following structure. The semantics of the U and T extensions are explained in
Unicode BCP 47 U Extension
and
Unicode BCP 47 T Extension
. Other extensions and private use extensions are supported for pass-through. The following table defines syntactically
well-formed
identifiers: they are not necessarily
valid
identifiers. For additional validity criteria, see the links on the right.
EBNF
Validity / Comments
unicode_locale_id
= unicode_language_id
extensions*
pu_extensions? ;
extensions
= unicode_locale_extensions
| transformed_extensions
| other_extensions ;
unicode_locale_extensions
= sep [uU]
((sep keyword)+
|(sep attribute)+ (sep ufield)*) ;
transformed_extensions
= sep [tT]
((sep tlang (sep tfield)*)
| (sep tfield)+) ;
pu_extensions
= sep [xX]
(sep alphanum{1,8})+ ;
other_extensions
= sep [alphanum-[tTuUxX]]
(sep alphanum{2,8})+ ;
ufield
(Also known as
keyword
= ukey (sep uvalue)? ;
ukey
(Also known as
key
= alphanum alpha ;
(Note that this is narrower than in [
RFC6067
], so that it is disjoint with tkey.)
validity
latest-data
uvalue
(Also known as
type
= alphanum{3,8}
(sep alphanum{3,8})* ;
validity
latest-data
attribute
= alphanum{3,8} ;
unicode_subdivision_id
unicode_region_subtag
unicode_subdivision_suffix ;
validity
latest-data
unicode_subdivision_suffix
= alphanum{1,4} ;
unicode_measure_unit
= alphanum{3,8}
(sep alphanum{3,8})* ;
validity
latest-data
tlang
= unicode_language_subtag
(sep unicode_script_subtag)?
(sep unicode_region_subtag)?
(sep unicode_variant_subtag)* ;
same as in unicode_language_id
tfield
= tkey tvalue;
validity
latest-data
tkey
= alpha digit ;
tvalue
= alphanum{3,8}
(sep alphanum{3,8})+ ;
The following are additional well-formedness constraints:
[ wfc: There cannot be more than one extension with the same singleton. For example, en-u-ca-buddhist-u-cf-standard is ill-formed.]
[ wfc: There cannot be more than one ukey or tkey. For example, en-u-ca-buddhist-ca-islamic is ill-formed. ]
[ wfc: The sequence of variant subtags in a tlang must not have any duplicates. ]
[ wfc: The private use extension (-x-) must come after all other extensions. ]
For historical reasons, this is called a Unicode locale identifier. However, it also functions (with few exceptions) as a language identifier, and accesses language-based data. Except where it would be unclear, this document uses the term "locale" data loosely to encompass both types of data: for more information, see
Language and Locale IDs
As of the release of this specification, there were no other_extensions defined. The other_extensions are present in the syntax to allow implementations to preserve that information.
As for terminology, the term
code
may also be used instead of "subtag", and "territory" instead of "region". The primary language subtag is also called the
base language code
. For example, the base language code for "en-US" (American English) is "en" (English). The
type
may also be referred to as a
value
or
key-value
All identifier field values are case-insensitive. Although case distinctions do not carry any special meaning, an implementation of LDML should use the casing recommendations in [
BCP47
], especially when a Unicode locale identifier is used for locale data exchange in software protocols.
The identifiers can vary in case and in the separator characters. The "-" and "_" separators are treated as equivalent, although "-" is preferred.
Unicode
BCP 47
locale identifier
unicode_bcp47_locale_id
) is a
unicode_locale_id
that meets the following additional constraints:
[ wfc: The EBNF
sep
is restricted to only [-] in
unicode_language_id
and
unicode_locale_id
.]
[ wfc: The first subtag must be a
unicode_language_subtag
.] Thus it can be
neither
of the following:
unicode_script_subtag
a "root" subtag (the "und"
unicode_language_subtag
is used instead of "root").
A well-formed
Unicode BCP 47 locale identifier
is always a well-formed
BCP 47 language tag
The reverse, however, is not guaranteed;
BCP 47 language tag
that contains an extlang subtag, an irregular subtag, or an initial 'x' subtag would not be a well-formed
Unicode BCP 47 locale identifier
— for details see
BCP 47 Conformance
However, any
BCP 47 language tag
can easily converted to a
Unicode BCP 47 locale identifier
as specified in
BCP 47 Language Tag Conversion
Unicode
CLDR
locale identifier
unicode_cldr_locale_id
) is a
unicode_locale_id
that meets the following additional constraints:
[ wfc: The EBNF
sep
is restricted to only [_] in
unicode_language_id
and
unicode_locale_id
.]
[ wfc: The
unicode_language_id
"und" is replaced by "root".]
[ wfc: The first subtag cannot be a
unicode_script_subtag
.]
Note:
The current version of CLDR data uses
Unicode
CLDR
locale identifiers
for backward compatibility. This might be changed in future CLDR releases.
Canonical Unicode Locale Identifiers
unicode_locale_id
has
canonical syntax
when:
It starts with a language subtag (those beginning with a script subtag are only for specialized use)
Casing
Any script subtag inside unicode_language_id is in title case (eg, Hant)
Any region subtag inside unicode_language_id is in uppercase (eg, DE)
All other subtags are in lowercase (eg, en, fonipa)
Order
Any variants are in alphabetical order (eg, en-fonipa-scouse, not en-scouse-fonipa)
Any extensions are in alphabetical order by their singleton (eg, en-t-xxx-u-yyy, not en-u-yyy-t-xxx)
All attributes are sorted in alphabetical order.
All keywords and tfields are sorted by alphabetical order of their keys, within their respective extensions.
Any type or tfield value "true" is removed.
For example, the canonical form of "en-u-foo-bar-nu-thai-ca-buddhist-kk-true" is "en-u-bar-foo-ca-buddhist-kk-nu-thai". The attributes
"foo"
and
"bar"
in this example are provided only for illustration; no attribute subtags are defined by the current CLDR specification.
NOTE: Some people may wonder why CLDR uses alphabetical order for variants, rather than the ordering in
Section 4.1
of BCP 47. Here are the considerations that lead to that decision:
The ordering in is recommended, but not required for conformance. In particular, use of and ordering by Prefix is recommended but not required.
Moreover,
Section 4.5
states that “If more than one variant appears within a tag, processors MAY reorder the variants to obtain better matching behavior or more consistent presentation.”
The best practices for internationalization have moved well beyond some of the guidelines and recommendations in BCP 47, especially for language matching and language fallback.
Robust implementations will accept the variants in any order, just as they accept extensions in any order.
A canonical order allows for determination of identity of identifiers via string comparison.
The ordering in does not result in a determinant order for canonicalization, since the mechanism for determining “importance” is not specified: ca-valencia-fonipa and ca-fonipa-valencia could both be ‘canonical’ variants of one another.
Pure alphabetical order is determinant and simple to implement while the ordering in is indeterminant, more complex, and provides no significant benefit in modern applications.
unicode_locale_id
is in
canonical form
when it has canonical syntax and contains no aliased subtags. A
unicode_locale_id
can be transformed into canonical form according to
Annex C. LocaleId Canonicalization
unicode_locale_id
is
maximal
when the
unicode_language_id
and tlang (if any) have been transformed by the Add Likely Subtags operation in
Likely Subtags
, excluding "und".
Example:
the maximal form of ja-Kana-t-it is ja-Kana-JP-t-it-latn-it
Note that the
latn
and final
it
don't use any uppercase characters, since they are not inside unicode_language_id.
Two
unicode_locale_ids
are
equivalent
when their maximal canonical forms are identical.
Example:
"IW-HEBR-u-ms-imperial" ~ "he-u-ms-uksystem"
The equivalence relationship may change over time, such as when subtags are deprecated or likely subtag mappings change. For example, if two countries were to merge, then various subtags would become deprecated. These kinds of changes are generally very infrequent.
BCP 47 Conformance
Unicode language and locale identifiers inherit the design and the repertoire of subtags from [
BCP47
] Language Tags. There are some extensions and restrictions made for the use of the Unicode locale identifier in CLDR:
It does not allow for the full syntax of [
BCP47
]:
No extlang subtags are allowed (as in the BCP 47 canonical form, see BCP 47
Section 4.5
and
Section 3.1.7
No irregular BCP 47 legacy language tags (marked as “Type: grandfathered” in BCP 47) are allowed (these are all deprecated in BCP 47)
A tag must not start with the subtag "x": thus a
privateuse
(eg x-abc) can only be after a language subtag, like "und"
It allows for certain semantic additions and constraints:
Certain codes that are private-use in BCP 47 and ISO are given semantics by LDML
Each macrolanguage has an identified primary encompassed language, which is treated as an alias for the macrolanguage, and thus is replaced when canonicalizing (as allowed by BCP 47, see
Section 4.1.2
It allows certain syntax for backwards compatibility (not BCP 47-compatible):
The "_" character for field separator characters, as well as the "-" used in [
BCP47
] (however, the canonical form is with "-")
The subtag "root" to indicate the generic locale used as the parent of all languages in the CLDR data model ("und" can be used instead)
The language tag may begin with a script subtag rather than a language subtag. This is specialized use only, and not required for CLDR conformance.
There are thus two subtypes of Unicode locale identifiers, as defined above.
Unicode
BCP 47
locale identifier
unicode_bcp47_locale_id
).
A well-formed
Unicode BCP 47 locale identifier
is also a well-formed
BCP 47 language tag
A well-formed
BCP 47 language tags
might not be a well-formed
Unicode BCP 47 locale identifier
Unicode
CLDR
locale identifier
unicode_cldr_locale_id
These can both be easily converted to and from
BCP 47 language tags
as described below.
BCP 47 Language Tag Conversion
The different identifiers can be converted to one another as described in this section.
A valid [
BCP47
] language tag can be converted to a valid Unicode BCP 47 locale identifier according to
Annex C. LocaleId Canonicalization
The result is a Unicode BCP 47 locale identifier, in canonical form. It is both a BCP 47 language tag and a Unicode locale identifier. Because the process maps from all BCP 47 language tags into a subset of BCP 47 language tags, the format changes are not reversible, much as a lowercase transformation of the string “McGowan” is not reversible.
Table:
BCP 47 Language Tag to Unicode BCP 47 Locale Identifier
Examples
BCP 47 language tag
Unicode BCP 47 locale identifier
Comments
en-US
en-US
no changes
iw-FX
he-FR
BCP 47 canonicalization
cmn-TW
zh-TW
language alias
zh-cmn-TW
zh-TW
BCP 47 canonicalization, then language alias
sr-CS
sr-RS
territory alias
sh
sr-Latn
multiple replacement subtags
sh-Cyrl
sr-Cyrl
no replacement with multiple replacement subtags
hy-SU
hy-AM
multiple territory values

i-enochian
und-x-i-enochian
prefix any legacy language tags (marked as “Type: grandfathered” in BCP 47) with "und-x-"
x-abc
und-x-abc
prefix with "und-", so that there is always a base language subtag
Unicode Locale Identifier: CLDR to BCP 47
A Unicode CLDR locale identifier can be converted to a valid [
BCP47
] language tag (which is also a Unicode BCP 47 locale identifier) by performing the following transformation.
Replace the "_" separators with "-"
Replace the special language identifier "root" with the BCP 47 primary language tag "und"
Add an initial "und" primary language subtag if the first subtag is a script.
Examples:
Unicode CLDR locale identifier
BCP 47 language tag
Comments
en_US
en-US
change separator
de_DE_u_co_phonebk
de-DE-u-co-phonebk
change separator
root
und
change to "und"
root_u_cu_usd
und-u-cu-usd
change to "und"
Latn_DE
und-Latn-DE
add "und"
Unicode Locale Identifier: BCP 47 to CLDR
A Unicode BCP 47 locale identifier can be transformed into a Unicode CLDR locale identifier by performing the following transformation.
the separator is changed to "_"
the primary language subtag "und" is replaced with "root" if no script, region, or variant subtags are present.
Examples:
BCP 47 language tag
Unicode CLDR locale identifier
Comments
en-US
en_US
changes separator
und
root
changes to "root", because no script, region, or variant tag is present
und-US
und_US
no change to "und", because a region subtag is present
und-u-cu-USD
root_u_cu_usd
changes to "root", because no script, region, or variant tag is present
Truncation
BCP 47 requires that implementations allow for language tags of at least 35 characters, in
Section 4.1.1
To allow for use of extensions, CLDR extends that minimum to 255 for Unicode locale identifiers.
Theoretically, a language tag could be far longer, due to the possibility of a large number of variants and extensions.
In practice, the typical size of a locale or language identifier will be much smaller, so implementations can optimize for smaller sizes, as long as there is an escape mechanism allowing for up to 255.
Language Identifier Field Definitions
Unicode language and locale identifier field values are provided in the following table. Note that some private-use BCP 47 field values are given specific meanings in CLDR. While field values are based on [
BCP47
] subtag values, their validity status in CLDR is specified by means of machine-readable files in the
common/validity/
subdirectory, such as language.xml. For the format of those files and more information, see
Validity Data
unicode_language_subtag
(also known as a
Unicode base language code
Subtags in the language.xml file (see
Validity Data
). These are based on [
BCP47
] subtag values marked as
Type: language
ISO 639-3 introduces the notion of "macrolanguages", where certain ISO 639-1 or ISO 639-2 codes are given broad semantics, and additional codes are given for the narrower semantics. For backwards compatibility, Unicode language identifiers retain use of the narrower semantics for these codes. For example:
For
Use
Not
Standard Chinese (Mandarin)
zh
cmn
Standard Arabic
ar
arb
Standard Malay
ms
zsm
Standard Swahili
sw
swh
Standard Uzbek
uz
uzn
Standard Konkani
kok
gom
Northern Kurdish
ku
kmr
If a language subtag matches the
type
attribute of a
languageAlias
element, then the replacement value is used instead. For example, because "swh" occurs in

, "sw" must be used instead of "swh". Thus Unicode language identifiers use "ar-EG" for Standard Arabic (Egypt), not "arb-EG"; they use "zh-TW" for Mandarin Chinese (Taiwan), not "cmn-TW".
The private use codes listed as
excluded
in
Private Use Codes
will never be given specific semantics in Unicode identifiers, and are thus safe for use for other purposes by other applications.
The CLDR provides data for normalizing language/locale codes, including mapping overlong codes like "eng-840" or "eng-USA" to the correct code "en-US"; see the
Aliases
Chart.
The following are special language subtags:
Name
Comment
mis
Uncoded languages
The content is in a language that doesn't yet have an ISO 639 code.
mul
Multiple languages
The content contains more than one language or text that is simultaneously in multiple languages (such as brand names).
zxx
No linguistic content
The content is not in any particular languages (such as images, symbols, etc.)
unicode_script_subtag
(also known as a
Unicode script code
Subtags in the script.xml file (see
Validity Data
). These are based on [
BCP47
] subtag values marked as
Type: script
In most cases the script is not necessary, since the language is only customarily written in a single script. Examples of cases where it is used are:
Subtag
Description
az_Arab
Azerbaijani in Arabic script
az_Cyrl
Azerbaijani in Cyrillic script
az_Latn
Azerbaijani in Latin script
zh_Hans
Chinese, in simplified script (=zh, zh-Hans, zh-CN, zh-Hans-CN)
zh_Hant
Chinese, in traditional script
Unicode identifiers give specific semantics to certain Unicode Script values. For more information, see also [
UAX24
]:
Qaag
Zawgyi
Qaag is a special script code for identifying the non-standard use of Myanmar characters for display with the Zawgyi font. The purpose of the code is to enable migration to standard, interoperable use of Unicode by providing an identifier for Zawgyi for tagging text, applications, input methods, font tables, transformations, and other mechanisms used for migration.
Qaai
Inherited
deprecated
: the
canonicalized
form is Zinh
Zinh
Inherited
Zsye
Emoji Style
Prefer emoji style for characters that have both text and emoji styles available.
Zsym
Text Style
Prefer text style for characters that have both text and emoji styles available.
Zxxx
Unwritten
Indicates spoken or otherwise unwritten content. For example:
Sample(s)
Description
uz
either written or spoken content
uz-Latn
or
uz-Arab
written-only content (particular script)
uz-Zyyy
written-only content (unspecified script)
uz-Zxxx
spoken-only content
uz-Latn, uz-Zxxx
both specific written and spoken content (using a
language list
Zyyy
Common
Zzzz
Unknown
The private use subtags listed as
excluded
in
Private Use Codes
will never be given specific semantics in Unicode identifiers, and are thus safe for use for other purposes by other applications.
unicode_region_subtag
(also known as a
Unicode region code,
or a
Unicode territory code
Subtags in the region.xml file (see
Validity Data
). These are based on [
BCP47
] subtag values marked as
Type: region
Unicode identifiers give specific semantics to the following subtags.
(The alpha2 codes are used as Unicode region subtags. The alpha3 and numeric codes are derived according to
Numeric Codes
and listed here for additional documentation.)
alpha2
alpha3
num
Name
Comment
ISO 3166-1 status
QO
QOO
961
Outlying Oceania
countries in Oceania [009] that do not have a
subcontinent
private use
QU
QUU
967
European Union
deprecated
: the
canonicalized
form is EU
private use
UK
United Kingdom
deprecated
: the
canonicalized
form is GB
exceptionally reserved
XA
XAA
973
Pseudo-Accents
special code indicating derived testing locale with English + added accents and lengthened
private use
XB
XBB
974
Pseudo-Bidi
special code indicating derived testing locale with forced RTL English
private use
XK
XKK
983
Kosovo
industry practice
private use
ZZ
ZZZ
999
Unknown or Invalid Territory
used in APIs or as replacement for invalid code
private use
The private use subtags listed as
excluded
in
Private Use Codes
will normally never be given specific semantics in Unicode identifiers, and are thus safe for use for other purposes by other applications. However, LDML may follow widespread industry practice in the use of some of these codes, such as for XK.
The CLDR provides data for normalizing territory/region codes, including mapping overlong codes like "eng-840" or "eng-USA" to the correct code "en-US".
Special Codes:
The territory code 'UK' has a special status in ISO, and is used for the domain name instead of GB. It is thus recognized by CLDR as being an alternate (unnormalized) form of 'GB'.
The territory code '001' (the World) is used to indicate a standardized form, such as "ar-001" for Modern Standard Arabic.
unicode_variant_subtag
(also known as a
Unicode language variant code
Subtags in the variant.xml file (see
Validity Data
). These are based on [
BCP47
] subtag values marked as
Type: variant
. The sequence of variant tags must not have any duplicates: thus de-1996-fonipa-1996 is invalid, while de-1996-fonipa and de-fonipa-1996 are both valid.
CLDR provides data for normalizing variant codes. About handling of the "POSIX" variant see
Legacy Variants
Examples:
en
fr_BE
zh-Hant-HK
Deprecated
codes—such as QU above—are valid, but strongly discouraged.
A locale that only has a language subtag (and optionally a script subtag) is called a
language locale
; one with both language and territory subtag is called a
territory locale
(or
country locale
).
Special Codes
Unknown or Invalid Identifiers
The following identifiers are used to indicate an unknown or invalid code in Unicode language and locale identifiers. For Unicode identifiers, the region code uses a private use ISO 3166 code, and Time Zone code uses an additional code; the others are defined by the relevant standards. When these codes are used in APIs connected with Unicode identifiers, the meaning is that either there was no identifier available, or that at some point an input identifier value was determined to be invalid or ill-formed.
Code Type
Value
Description in Referenced Standards
Language
und
Undetermined language, also used for “root”
Script
Zzzz
Code for uncoded script, Unknown [
UAX24
Region
ZZ
Unknown or Invalid Territory
Currency
XXX
The codes assigned for transactions where no currency is involved
Time Zone
unk
Unknown or Invalid Time Zone
Subdivision
zzzz
Unknown or Invalid Subdivision
When only the script or region are known, then a locale ID will use "und" as the language subtag portion. Thus the locale tag "und_Grek" represents the Greek script; "und_US" represents the US territory.
Numeric Codes
For region codes, ISO and the UN establish a mapping to three-letter codes and numeric codes. However, this does not extend to the private use codes, which are the codes 900-999 (total: 100), and AAA, QMA-QZZ, XAA-XZZ, and ZZZ (total: 1092). Unicode identifiers supply a standard mapping to these: for the numeric codes, it uses the top of the numeric private use range; for the 3-letter codes it doubles the final letter. These are the resulting mappings for all of the private use region codes:
Region
UN/ISO Numeric
ISO 3-Letter
AA
958
AAA
QM..QZ
959..972
QMM..QZZ
XA..XZ
973..998
XAA..XZZ
ZZ
999
ZZZ
For script codes, ISO 15924 supplies a mapping (however, the numeric codes are not in common use):
Script
Numeric
Qaaa..Qabx
900..949
Private Use Codes
Private use codes fall into three groups.
defined:
those that are given particular semantics currently in CLDR
reserved:
those that may be given particular semantics in future versions of CLDR
excluded:
those that will never be given particular CLDR semantics in the future, and thus can normally be used by applications without worrying about collisions. However, CLDR may follow widespread industry practice in the use of some of these codes, such as for XA, XB, and XK.
Table:
Private Use Codes in CLDR
category
status
codes
base language
defined
none
reserved
qaa..qfy
excluded
qfz..qtz
script
defined
Qaai (obsolete), Qaag
reserved
Qaaa..Qaaf Qaah Qaaj..Qaap
excluded
Qaaq..Qabx
region
defined
QO, QU, UK, XA, XB, XK, ZZ
reserved
AA QM..QN QP..QT QV..QZ
excluded
XC..XJ, XL..XZ
timezone
defined
IANA: Etc/Unknown
bcp47: as listed in bcp47/timezone.xml
reserved
bcp47: all non-5 letter codes not starting with x
excluded
bcp47: all non-5 letter codes starting with x
See also
Unknown or Invalid Identifiers
Special Script Codes
Certain valid script code require special handling.
These are the codes in
Script Codes
with the words "variant" or "alias" within parentheses,
excluding Zsye.
The Compound codes include characters in multiple scripts;
the Visual variants are distinct in appearance, but otherwise encompass a single script;
and the Subsets exclude certain characters from a script.
The Equivalents for Subsets are not as well defined, so the "Equivalents" are marked as approximate.
Variant
Script
Equivalent
Compound
Jpan
≡ Hani ∪ Hira ∪ Kana
Hrkt
≡ Hira ∪ Kana
Kore
≡ Hani ∪ Hang
Hanb
≡ Hani ∪ Bopo
Visual
Aran
≡ Arab (Nastaliq variant)
Cyrs
≡ Cyrl (Old Church Slavonic variant)
Latf
≡ Latn (Fraktur variant)
Latg
≡ Latn (Gaelic variant)
Syrn
≡ Syrc (Eastern variant)
Syre
≡ Syrc (Estrangelo variant)
Syrj
≡ Syrc (Western variant)
Subset
Jamo
≃ Hang − LVT - LV
Hans
≃ Hani − Traditional-only
Hant
≃ Hani − Simplified-only
The special codes most frequently used are in the locale identifiers
zh-Hans
zh-Hant
ja-Jpan
, and
ko-Kore
the first two are
Subsets
, and the last two are
Compounds
These are used, for example, in
Likely Subtags
in LDML.
The
Equivalent
values in the
Subset
variants are only approximate,
and
the variants are also visual variants.
Thus
Hans
is a request for:
Not using characters that are Traditional-only
Characters common between Simplified and Traditional to be given a Simplified rendering.
Visual
variant script codes (that are not
Subset
variants) can be used in a locale identifier to request a particular rendering.
For example, ar_Aran could be used to request that ar_Arab data be used, but with a Nastaliq-style font.
However, the few variant script codes represent only a very small fraction of the different script variants in use.
Moreover, this feature is not widely supported, and may give unexpected results when not supported.
For example, an implmentation might not recognize
Aran
in
uz-Aran
at all, and return results for
uz-Latn
Some of the special codes are used in other specifications,
such as in
Mixed_Script_Detection
Unicode BCP 47 U Extension
BCP47
] Language Tags provides a mechanism for extending language tags for use in various applications by extension subtags. Each extension subtag is identified by a single alphanumeric character subtag assigned by IANA.
The Unicode Consortium has registered and is the maintaining authority for two BCP 47 language tag extensions: the extension 'u' for Unicode locale extension [
RFC6067
] and extension 't' for transformed content [
RFC6497
]. The Unicode BCP 47 extension data defines the complete list of valid subtags.
These subtags are all in lowercase (that is the canonical casing for these subtags), however, subtags are case-insensitive and casing does not carry any specific meaning. All subtags within the Unicode extensions are alphanumeric characters in length of two to eight that meet the rule
extension
in the [
BCP47
].
The -u- Extension.
The syntax of 'u' extension subtags is defined by the rule
unicode_locale_extensions
in
Unicode locale identifier
, except the separator of subtags
sep
must be always hyphen '-' when the extension is used as a part of BCP 47 language tag.
A 'u' extension may contain multiple
attribute
s or
keyword
s as defined in
Unicode locale identifier
. The canonical syntax is defined as in
Canonical Unicode Locale Identifiers
See also
Unicode Extensions for BCP 47
on the CLDR site.
Key And Type Definitions
The following chart contains a set of U extension key values that are currently available, with a description or sampling of the U extension type values. Each category is associated with an XML file in the bcp47 directory.
For the complete list of valid keys and types defined for Unicode locale extensions, see
U Extension Data Files
. For information on the process for adding new
key
type
, see [
LocaleProject
].
Most type values are represented by a single subtag in the current version of CLDR. There are exceptions, such as types used for key "ca" (calendar) and "kr" (collation reordering). If the type is not included, then the type value "true" is assumed. Note that the default for key with a possible "true" value is often "false", but may not always be. Note also that "true"/"True" is not a valid script code, since
the ISO 15924 Registration Authority has exceptionally reserved it
, which means that it will not be assigned for any purpose.
Note that canonicalization does not change invalid locales to valid locales. For example, und-u-ka canonicalizes to und-u-ka-true, but:
"und-u-ka-true" — is invalid, since "true" is not a valid value for ka
"und-u-ka" — is invalid, since the value "true" is assumed whenever there is no value, and "true" is not a valid value for ka
The BCP 47 form for keys and types is the canonical form, and recommended. Other aliases are included for backwards compatibility.
Table:
Key/Type Definitions
key
(old key name)
key description
example type
(old type name)
type description
Unicode Calendar Identifier
defines a type of calendar.
Well-formed values match
uvalue
The valid values are those
name
attribute values in the
type
elements of key name="ca"
in bcp47/
calendar.xml
This selects calendar-specific data within a locale used for formatting and parsing, such as date/time symbols and patterns; it also selects supplemental
calendarData used for calendrical calculations.
The value can affect the computation of the first day of the week: see
First Day Overrides
ca
(calendar)
Calendar algorithm
(For information on the calendar algorithms associated with the data used with these, see [
Calendars
].)
buddhist
Thai Buddhist calendar (same as Gregorian except for the year)
chinese
Traditional Chinese calendar
gregory
Gregorian calendar
islamic
Islamic calendar
islamic-civil
Islamic calendar, tabular (intercalary years [2,5,7,10,13,16,18,21,24,26,29] - civil epoch)
islamic-umalqura
Islamic calendar, Umm al-Qura
Note:
Some calendar types are represented by two subtags. In such cases, the first subtag specifies a generic calendar type and the second subtag specifies a calendar algorithm variant. The CLDR uses generic calendar types (single subtag types) for tagging data when calendar algorithm variations within a generic calendar type are irrelevant. For example, type "islamic" is used for specifying Islamic calendar formatting data for all Islamic calendar types, including "islamic-civil" and "islamic-umalqura".
Unicode Currency Format Identifier
defines a style for currency formatting.
Well-formed values match
uvalue
The valid values are those
name
attribute values in the
type
elements of key name="cf" in
bcp47/
currency.xml
This selects the specific type of currency formatting pattern within a locale.
cf
Currency Format style
standard
Negative numbers use the minusSign symbol (the default).
account
Negative numbers use parentheses or equivalent.
Unicode Collation Identifier
defines a type of collation (sort order).
Well-formed values match
uvalue
The valid values are those
name
attribute values in the
type
elements of bcp47/
collation.xml
For information on each collation setting parameter, from
ka
to
vt
, see
Setting Options
co
(collation)
Collation type
standard
The default ordering for each language. For root it is based on the [
DUCET
] (Default Unicode Collation Element Table): see
Root Collation
. Each other locale is based on that, except for appropriate modifications to certain characters for that language.
A special collation type dedicated for string search—it is not used to determine the relative order of two strings, but only to determine whether they should be considered equivalent for the specified strength, using the string search matching rules appropriate for the language. Compared to the normal collator for the language, this may add or remove primary equivalences, may make additional characters ignorable or change secondary equivalences, and may modify contractions to allow matching within them, depending on the desired behavior. For example, in Czech, the distinction between ‘a’ and ‘á’ is secondary for normal collation, but primary for search; a search for ‘a’ should never match ‘á’ and vice versa. A search collator is normally used with strength set to PRIMARY or SECONDARY (should be SECONDARY if using “asymmetric” search as described in the [
UCA
] section Asymmetric Search). The search collator in root supplies matching rules that are appropriate for most languages (and which are different than the root collation behavior); language-specific search collators may be provided to override the matching rules for a given language as necessary.
Other keywords provide additional choices for certain locales;
they only have effect in certain locales.
phonetic
Requests a phonetic variant if available, where text is sorted based on pronunciation. It may interleave different scripts, if multiple scripts are in common use.
pinyin
Pinyin ordering for Latin and for CJK characters; that is, an ordering for CJK characters based on a character-by-character transliteration into a pinyin. (used in Chinese)
searchjl
Special collation type for a modified string search in which a pattern consisting of a sequence of Hangul initial consonants (jamo lead consonants) will match a sequence of Hangul syllable characters whose initial consonants match the pattern. The jamo lead consonants can be represented using conjoining or compatibility jamo. This search collator is best used at SECONDARY strength with an "asymmetric" search as described in the [
UCA
] section Asymmetric Search and obtained, for example, using ICU4C's usearch facility with attribute USEARCH_ELEMENT_COMPARISON set to value USEARCH_PATTERN_BASE_WEIGHT_IS_WILDCARD; this ensures that a full Hangul syllable in the search pattern will only match the same syllable in the searched text (instead of matching any syllable with the same initial consonant), while a Hangul initial consonant in the search pattern will match any Hangul syllable in the searched text with the same initial consonant.
Unicode Currency Identifier
defines a type of currency.
Well-formed values match
uvalue
The valid values are those
name
attribute values in the
type
elements of key name="cu" in bcp47/
currency.xml
cu
(currency)
Currency type
ISO 4217 code,
plus others in common use
Well-formed codes are of the form
[A-Za-z]{3}
, with the canonical format being
[A-Z]{3}
The valid codes are ones that are or have been valid in ISO 4217, plus certain additional codes that are or have been in common use.
Supplemental Currency Data
provides the list of countries (regions) and time periods associated with each currency code.
It also supplies the default number of decimals.
The XXX code is given a broader interpretation than in ISO 4217, as
Unknown or Invalid Currency
Unicode Dictionary Break Exclusion Identifier
specifies scripts to be excluded from dictionary-based text break
(for words and lines).
Well-formed values match
uvalue
The valid values are of one or more items of type SCRIPT_CODE as specified in the
name
attribute value in the
type
element of
key name="dx" in bcp47/
segmentation.xml
This affects break iteration regardless of locale.
dx
Dictionary break script exclusions
unicode_script_subtag
values
One or more items of type SCRIPT_CODE (as usual, separated by hyphens), which are valid
unicode_script_subtag
values.
Each of the values for the DX key must be a short script property value in the UCD, or one of the compound script values like jpan. The compound script values are expanded when interpreted, eg, -dx-jpan = -dx-hani-hira-kata
The values may be in any order, eg, -dx-thai-hani = dx-hani-thai. However, the canonical order for the bcp47 subtag is alphabetical, eg, dx-hani-thai
Dictionary-based break iterators will ignore each character whose Script_Extension value set intersects with the DX value set.
The code Zyyy (Common) can be specified to exclude all scripts, if and only if it is the only SCRIPT_CODE value specified. If it is not the only script code, Zyyy has the normal meaning: excluding Script_Extension=Common.
Unicode Emoji Presentation Style Identifier
specifies a request for the preferred emoji presentation style. This can be used as part of the value for an HTML lang attribute, for example

Well-formed values match
uvalue
The valid values are those
name
attribute values in the
type
elements of key name="em" in bcp47/
variant.xml
em
Emoji presentation style
emoji
Use an emoji presentation for emoji characters if possible.
text
Use a text presentation for emoji characters if possible.
default
Use the default presentation for emoji characters as specified in UTR #51
Presentation Style
Unicode First Day Identifier
defines the preferred first day of the week for calendar display. Specifying "fw" in a locale identifier overrides the default value specified by supplemental
week data for the region (see Part 4 Dates,
Week Data
).
Well-formed values match
uvalue
The valid values are those
name
attribute values in the
type
elements
of key name="fw" in bcp47/
calendar.xml
The value can affect the computation of the first day of the week: see
First Day Overrides
fw
First day of week
sun
Sunday
mon
Monday
sat
Saturday
Unicode Hour Cycle Identifier
defines the preferred time cycle. Specifying "hc" in a locale identifier overrides the default value specified by supplemental time data for the region
(see Part 4 Dates,
Time Data
).
Well-formed values match
uvalue
The valid values are those
name
attribute values in the
type
elements of
key name="hc" in bcp47/
calendar.xml
hc
Hour cycle
h12
Hour system using 1–12; corresponds to 'h' in patterns
h23
Hour system using 0–23; corresponds to 'H' in patterns
h11
Hour system using 0–11; corresponds to 'K' in patterns
h24
Hour system using 1–24; corresponds to 'k' in pattern
Unicode Line Break Style Identifier
defines a preferred line break style corresponding to the CSS level 3
line-break option
Specifying "lb" in a locale identifier overrides the locale’s default style (which may correspond to "normal" or "strict").
Well-formed values match
uvalue
The valid values are those
name
attribute values in the
type
elements of key name="lb" in bcp47/
segmentation.xml
lb
Line break style
strict
CSS level 3 line-break=strict, e.g. treat CJ as NS
normal
CSS level 3 line-break=normal, e.g. treat CJ as ID, break before hyphens for ja,zh
loose
CSS lev 3 line-break=loose
Unicode Line Break Word Identifier
defines preferred line break word handling behavior corresponding to the CSS level 3
word-break option
Specifying "lw" in a locale identifier overrides the locale’s default style (which may correspond to "normal" or "keepall").
Well-formed values match
uvalue
The valid values are those
name
attribute values in the
type
elements of key name="lw" in bcp47/
segmentation.xml
lw
Line break word handling
normal
CSS level 3 word-break=normal, normal script/language behavior for midword breaks
breakall
CSS level 3 word-break=break-all, allow midword breaks unless forbidden by lb setting
keepall
CSS level 3 word-break=keep-all, prohibit midword breaks except for dictionary breaks
phrase
Prioritize keeping natural phrases (of multiple words) together when breaking, used in short text like title and headline
Unicode Measurement System Identifier
defines a preferred measurement system. Specifying "ms" in a locale identifier overrides the default value specified by supplemental measurement system data for the region
(see Part 2 General,
Measurement System Data
).
Well-formed values match
uvalue
The valid values are those
name
attribute values in the
type
elements of key name="ms" in bcp47/
measure.xml
The determination of preferred units depends on the locale identifer: the keys ms, mu, rg, the base locale (language, script, region) and the user preferences.
For information about preferred units and unit conversion, see
Unit Conversion
and
Unit Preferences
ms
Measurement system
metric
Metric System
ussystem
US System of measurement: feet, pints, etc.; pints are 16oz
uksystem
UK System of measurement: feet, pints, etc.; pints are 20oz
Measurement Unit Preference Override
defines an override for measurement unit preference.
Well-formed values match
uvalue
The valid values are those
name
attribute values in the
type
elements of key name="mu" in
bcp47/
measure.xml
For information about preferred units and unit conversion, see
Unit Conversion
and
Unit Preferences
mu
Measurement unit override
celsius
Celsius as temperature unit
kelvin
Kelvin as temperature unit
fahrenhe
Fahrenheit as temperature unit
Unicode Number System Identifier
defines a type of number system.
Well-formed values match
uvalue
The valid values are those
name
attribute values in the
type
elements of bcp47/
number.xml
nu
(numbers)
Numbering system
Unicode script subtag
Four-letter types indicating the primary numbering system for the corresponding script represented in Unicode. Unless otherwise specified, it is a decimal numbering system using digits [:GeneralCategory=Nd:]. For example, "latn" refers to the ASCII / Western digits 0-9, while "taml" is an algorithmic (non-decimal) numbering system. (The code "tamldec" is indicates the "modern Tamil decimal digits".)
For more information, see
Numbering Systems
arabext
Extended Arabic-Indic digits ("arab" means the base Arabic-Indic digits)
armnlow
Armenian lowercase numerals
roman
Roman numerals
romanlow
Roman lowercase numerals
tamldec
Modern Tamil decimal digits
Region Override
specifies an alternate region to use for obtaining
certain region-specific default values (those specified by the

element), instead of using the region
specified by the
unicode_region_subtag
in the Unicode Language Identifier (or inferred from the
unicode_language_subtag
rg
Region Override
uszzzz
The valid values are a
unicode_subdivision_id
of type “unknown” or “regular”;
this consists of a
unicode_region_subtag
for a regular region (not a macroregion),
suffixed either by “zzzz” (case is not significant) to designate the region as a whole,
or by a unicode_subdivision_suffix to provide more specificity.
For example, “en-GB-u-rg-uszzzz” represents a locale for British English but with region-specific defaults set to US for items such as default currency, default calendar and week data, default time cycle, and default measurement system and unit preferences.
The determination of preferred units depends on the locale identifer: the keys ms, mu, rg, the base locale (language, script, region) and the user preferences.
The value can affect the computation of the first day of the week: see
First Day Overrides
For information about preferred units and unit conversion, see
Unit Conversion
and
Unit Preferences
Unicode Subdivision Identifier
defines a regional subdivision used for locales.
Well-formed values match
uvalue
The valid values are based on the
subdivisionContainment
element as described in
Section
3.6.5 Subdivision Codes
sd
Regional Subdivision
gbsct
unicode_subdivision_id
, which is a
unicode_region_subtag
concatenated with a unicode_subdivision_suffix.
For example,
gbsct
is “gb”+“sct” (where sct represents the subdivision code for Scotland). Thus “en-GB-u-sd-gbsct” represents the language variant “English as used in Scotland”. And both “en-u-sd-usca” and “en-US-u-sd-usca” represent “English as used in California”. See
3.6.5 Subdivision Codes
The value can affect the computation of the first day of the week: see
First Day Overrides
Unicode Sentence Break Suppressions Identifier
defines a set of data to be used for suppressing certain sentence breaks that would otherwise be found by UAX #14 rules.
Well-formed values match
uvalue
The valid values are those
name
attribute values in the
type
elements of key name="ss" in bcp47/
segmentation.xml
ss
Sentence break suppressions
none
Don’t use sentence break suppressions data (the default).
standard
Use sentence break suppressions data of type "standard"
Unicode Timezone Identifier
defines a timezone.
Well-formed values match
uvalue
The valid values are those name attribute values in the
type
elements of bcp47/
timezone.xml
tz
(timezone)
Time zone
Unicode short time zone IDs
Short identifiers defined in terms of a TZ time zone database [
Olson
] identifier in the common/bcp47/timezone.xml file, plus a few extra values.
For more information, see
Time Zone Identifiers
CLDR provides data for normalizing timezone codes.
Unicode Variant Identifier
defines a special variant used for locales.
Well-formed values match
uvalue
The valid values are those name attribute values in the
type
elements of bcp47/
variant.xml
va
Common variant type
posix
POSIX style locale variant. About handling of the "POSIX" variant see
Legacy Variants
For more information on the allowed keys and types, see the specific elements below, and
U Extension Data Files
Additional keys or types might be added in future versions. Implementations of LDML should be robust to handle any syntactically valid key or type values.
Numbering System Data
LDML supports multiple numbering systems. The identifiers for those numbering systems are defined in the file
bcp47/number.xml
. For example, for the latest version of the data see
bcp47/number.xml
Details about those numbering systems are defined in
supplemental/numberingSystems.xml
. For example, for the latest version of the data see
supplemental/numberingSystems.xml
LDML makes certain stability guarantees on this data:
Like other BCP 47 identifiers, once a numeric identifier is added to
bcp47/number.xml
or
numberingSystems.xml
, it will never be removed from either of those files.
If an identifier has type="numeric" in numberingSystems.xml, then
It is a decimal, positional numbering system with an attribute
digits=X
, where
is a string with the 10 digits in order used by the numbering system.
The values of the type and digits will never change.
Time Zone Identifiers
LDML inherits time zone IDs from the tz database [
Olson
]. Because these IDs from the tz database do not satisfy the BCP 47 language subtag syntax requirements, CLDR defines short identifiers for the use in the Unicode locale extension. The short identifiers are defined in the file
common/bcp47/timezone.xml
The short identifiers use UN/LOCODE [
LOCODE
] (excluding a space character) codes where possible. For example, the short identifier for "America/Los_Angeles" is "uslax" (the LOCODE for Los Angeles, US is "US LAX"). Identifiers of length not equal to 5 are used where there is no corresponding UN/LOCODE, such as "usnavajo" for "America/Shiprock", or "utcw01" for "Etc/GMT+1", so that they do not overlap with future UN/LOCODE.
Although the first two letters of a short identifier may match an ISO 3166 two-letter country code, a user should not assume that the time zone belongs to the country. The first two letters in an identifier of length not equal to 5 have no meaning. Also, the identifiers are stabilized, meaning that they will not change no matter what changes happen in the base standard. So if Hawaii leaves the US and joins Canada as a new province, the short time zone identifier "ushnl" would not change in CLDR even if the UN/LOCODE changes to "cahnl" or something else.
There is a special code "unk" for an Unknown or Invalid time zone. This can be expressed in the tz database style ID "Etc/Unknown", although it is not defined in the tz database.
Stability of Time Zone Identifiers
Although the short time zone identifiers are guaranteed to be stable, the preferred IDs in the tz database (as those found in
zone.tab
file) might be changed time to time. For example, "Asia/Culcutta" was replaced with "Asia/Kolkata" and moved to
backward
file in the tz database. CLDR contains locale data using a time zone ID from the tz database as the key, stability of the IDs is critical.
To maintain the stability of "long" IDs (for those inherited from the tz database), a special rule applied to the
alias
attribute in the

element for "tz" - the first "long" ID is the CLDR canonical "long" time zone ID. In addition to this,
iana
attribute specifies the preferred ID in the tz database if it's different from the CLDR canonical "long" ID.
For example:

Above

element defines the short time zone ID "inccu" (for the use in the Unicode locale extension), corresponding
CLDR canonical "long" ID
"Asia/Culcutta", and an alias "Asia/Kolkata". In the tz database, the preferred ID for this time zone is "Asia/Kolkata".
Links in the tz database
Not all TZDB links are in CLDR aliases.
CLDR purposefully does not exactly match the Link structure in the TZDB.
The links are maintained in the TZDB, and it would duplicate information that could fall out of sync (especially because the TZDB can be updated many times in a single month).
The TZDB went though a change a few years ago where it dropped the mappings to countries (regions), whereas CLDR still maintains that distinction.
Because there are several different timezones that all link together, that would make for a single long alias being an alias for several different short aliases.
CLDR doesn't alias across country boundaries because countries are useful for timezone selection.
Even if, for example, Serbia and Croatia share the same rules, CLDR maintains the difference so that the user can either pick "Serbia time" or "Croatia time".
The Croat is not forced to pick "Serbia time" (Europe/Belgrade) nor the Serb forced to pick “Croatia time” (Europe/Zagreb).
U Extension Data Files
The 'u' extension data is stored in multiple XML files located under common/bcp47 directory in CLDR. Each file contains the locale extension key/type values and their backward compatibility mappings appropriate for a particular domain.
common/bcp47/collation.xml
contains key/type values for collation, including optional collation parameters and valid type values for each key.
The 't' extension data is stored in
common/bcp47/transform.xml

The extension attribute in

element specifies the BCP 47 language tag extension type. The default value of the extension attribute is "u" (Unicode locale extension). The

element is only applicable to the enclosing

In the Unicode locale extension 'u' and 't' data files, the common attributes for the

and

elements are as follows:
name
The key or type name used by Unicode locale extension with
'u' extension syntax
or the 't' extensions syntax. When
alias
below is absent, this name can be also used with the old style
"@key=type" syntax
Most type names are
literal type names
, which match exactly the same value. All of these have at least one lowercase letter, such as "buddhist". There are a small number of
indirect type names
, such as "RG_KEY_VALUE". These have no lowercase letters. The interpretation of each one is listed below.
CODEPOINTS
The type name
"CODEPOINTS"
is reserved for a variable representing Unicode code point(s). The syntax is:
EBNF
codepoints
= codepoint (sep codepoint)?
codepoint
= [0-9 A-F a-f]{4,6}
In addition, no codepoint may exceed 10FFFF. For example, "00A0", "300b", "10D40C" and "00C1-00E1" are valid, but "A0", "U060C" and "110000" are not.
In the current version of CLDR, the type "CODEPOINTS" is only used for the deprecated locale extension key "vt" (variableTop). The subtags forming the type for "vt" represent an arbitrary string of characters. There is no formal limit in the number of characters, although practically anything above 1 will be rare, and anything longer than 4 might be useless. Repetition is allowed, for example, 0061-0061 ("aa") is a Valid type value for "vt", since the sequence may be a collating element. Order is vital: 0061-0062 ("ab") is different than 0062-0061 ("ba"). Note that for variableTop any character sequence must be a contraction which yields exactly one primary weight.
For example,
en-u-vt-00A4
: this indicates English, with any characters sorting at or below " ¤" (at a primary level) considered Variable.
By default in UCA, variable characters are ignored in sorting at a primary, secondary, and tertiary level. But in CLDR, they are not ignorable by default. For more information, see
Collation:
Setting Options
REORDER_CODE
The type name
"REORDER_CODE"
is reserved for reordering block names (e.g. "latn", "digit" and "others") defined in the
Root Collation
. The type "REORDER_CODE" is used for locale extension key "kr" (colReorder). The value of type for "kr" is represented by one or more reordering block names such as "latn-digit". For more information, see
Collation:
Collation Reordering
RG_KEY_VALUE
The type name
"RG_KEY_VALUE"
is reserved for region codes in the format required by the "rg" key; this is a subdivision code with idStatus='unknown' or 'regular' from the idValidity data in common/validity/subdivision.xml.
SCRIPT_CODE
The type name
"SCRIPT_CODE"
is reserved for
unicode_script_subtag
values (e.g. "thai", "laoo"). The type "SCRIPT_CODE" is used for locale extension key "dx". The value of type for "dx" is represented by one or more SCRIPT_CODEs, such as "thai-laoo".
SUBDIVISION_CODE
The type name
"SUBDIVISION_CODE"
is reserved for subdivision codes in the format required by the "sd" key; this is a subdivision code from the idValidity data in common/validity/subdivision.xml, excluding those with idStatus='unknown'. Codes with idStatus='deprecated' should not be generated, and those with idStatus='private_use' are only to be used with prior agreement.
PRIVATE_USE
The type name
"PRIVATE_USE"
is reserved for private use types. A valid type value is composed of one or more subtags separated by hyphens and each subtag consists of three to eight ASCII alphanumeric characters. In the current version of CLDR,
"PRIVATE_USE"
is only used for transform extension "x0".
valueType
The
valueType
attribute indicates how many subtags are valid for a given key:
Value
Description
single
Either exactly one type value, or no type value (but only if the value of "true" would be valid). This is the default if no valueType attribute is present.
incremental
Multiple type values are allowed, but only if a prefix is also present, and the sequence is explicitly listed. Each successive type value indicates a refinement of its prefix. For example:

Thus
ca-islamic-umalqura
is valid. However,
ca-gregory-japanese
is not valid, because "gregory-japanese" is not listed as a type.
multiple
Multiple type values are allowed, but each may only occur once. For example:

any
Any number of type values are allowed, with none of the above restrictions. For example:

description
The description of the
key
type
or
attribute
element. There is also some informative text about certain keys and types in the
Key And Type Definitions
deprecated
The deprecation status of the
key
type
or
attribute
element. The value
"true"
indicates the element is deprecated and no longer used in the version of CLDR. The default value is
"false"
preferred
The preferred value of the deprecated
key
type
or
attribute
element. When a
key
type
or
attribute
element is deprecated, this attribute is used for specifying a new canonical form if available.
alias
(Not applicable to

The BCP 47 form is the canonical form, and recommended. Other aliases are included only for backwards compatibility.
Example:

The preferred term, and the only one to be used in BCP 47, is the name: in this example, "phonebk".
The alias is a key or type name used by Unicode locale extensions with the old
"@key=type" syntax
. The attribute value for type may contain multiple names delimited by ASCII space characters. Of those aliases, the first name is the preferred value.
since
The version of CLDR in which this key or type was introduced. Absence of this attribute value implies the key or type was available in CLDR 1.7.2.
Note: There are no values defined for the locale extension attribute in the current CLDR release.
For example,

...

...

The data above indicates:
type "pinyin" is valid for key "co", thus "u-co-pinyin" is a valid Unicode locale extension.
type "pinyin" is not valid for key "ka", thus "u-ka-pinyin" is not a valid Unicode locale extension.
type "pinyin" has no
alias
, so "zh@collation=pinyin" is a valid Unicode locale identifier according to the old syntax.
type "noignore" has an alias attribute, so "en@colAlternate=noignore" is not a valid Unicode locale identifier according to the old syntax.
type "aumel" is valid for key "tz", supported by CLDR 1.7.2 (default value) or later versions.
type "aumqi" is valid for key "tz", supported by CLDR 1.8.1 or later versions.
It is strongly recommended that all API methods accept all possible aliases for keywords and types, but generate the canonical form. For example, "ar-u-ca-islamicc" would be equivalent to "ar-u-ca-islamic-civil" on input, but the latter should be output. The one exception is where an alias would only be well-formed with the old syntax, such as "gregorian" (for "gregory").
In the Unicode locale extension 'u' data files,

element has an optional attribute below:
iana
This attribute is used by
tz
types for specifying preferred zone ID in the IANA time zone database.
Subdivision Codes
The subdivision codes designate a subdivision of a country or region. They are called various names, such as a
state
in the United States, or a
province
in Canada. The codes in CLDR are based on ISO 3166-2 subdivision codes. The ISO codes have a region code followed by a hyphen, then a suffix consisting of 1..3 ASCII letters or digits.
The CLDR codes are designed to work in a
unicode_locale_id
(BCP 47), and are thus all lowercase, with no hyphen. For example, the following are valid, and mean “English as used in California, USA”.
en-u-sd-
usca
en-US-u-sd-
usca
CLDR has additional subdivision codes. These may start with a 3-digit region code or use a suffix of 4 ASCII letters or digits, so they will not collide with the ISO codes. Subdivision codes for unknown values are the region code plus "zzzz", such as "uszzzz" for an unknown subdivision of the US. Other codes may be added for stability.
Like BCP 47, CLDR requires stable codes, which are not guaranteed for ISO 3166-2 (nor have the ISO 3166-2 codes been stable in the past). If an ISO 3166-2 code is removed, it remains valid (though marked as deprecated) in CLDR. If an ICU 3166-2 code is reused (for the same region), then CLDR will define a new equivalent code using these as 4-character suffixes.
Validity
unicode_subdivision_id
is only valid when it is present in the subdivision.xml file as described in
Validity Data
. The data is in a compressed form, and thus needs to be expanded before such a test is made.
Examples:
usca
is valid — there is an
id
element
… usca …
ussct
is invalid — there is no
id
element
… ussct …
If a
unicode_locale_id
contains both a
unicode_region_subtag
and a
unicode_subdivision_id
, it is only valid if the
unicode_subdivision_id
starts with the
unicode_region_subtag
(case-insensitively).
It is recommended that a
unicode_locale_id
contain a
unicode_region_subtag
if it contains a
unicode_subdivision_id
and the region would not be added by adding likely subtags. That produces better behavior if the
unicode_subdivision_id
is ignored by an implementation or if the language tag is truncated.
Examples:
en-
US
-u-sd-
us
ca is valid — the region "US" matches the first part of "usca"
en-u-sd-
us
ca is valid — it still works after adding likely subtags.
en-
CA
-u-sd-
gb
sct is invalid — the region "CA" does not match the first part of "gbsct". An implementation should disregard the subdivision id (or return an error).
en-u-sd-
gb
sct is valid but not recommended — an implementation that ignores the
unicode_subdivision_id
can get the wrong fallback behavior, or could add likely subtags and get the invalid en-
Latn-US
-u-sd-
gb
sct
In version 28.0, the subdivisions in the validity files used the ISO format, uppercase with a hyphen separating two components, instead of the BCP 47 format.
Unicode BCP 47 T Extension
The Unicode Consortium has registered and is the maintaining authority for two BCP 47 language tag extensions: the extension 'u' for Unicode locale extension [
RFC6067
] and extension 't' for transformed content [
RFC6497
]. The Unicode BCP 47 extension data defines the complete list of valid subtags. While the title of the RFC is “Transformed Content”, the abstract makes it clear that the scope is broader than the term "transformed" might indicate to a casual reader: “including content that has been transliterated, transcribed, or translated, or
in some other way influenced by the source. It also provides for additional information used for identification.
The -t- Extension.
The syntax of 't' extension subtags is defined by the rule
transformed_extensions
in
_ Unicode locale identifier_
, except the separator of subtags
sep
must be always hyphen '-' when the extension is used as a part of BCP 47 language tag. For information about the registration process, meaning, and usage of the 't' extension, see [
RFC6497
].
These subtags are all in lowercase (that is the canonical casing for these subtags), however, subtags are case-insensitive and casing does not carry any specific meaning. All subtags within the Unicode extensions are alphanumeric characters in length of two to eight that meet the rule
extension
in the [
BCP47
].
The following keys are defined for the -t- extension.
Well-formed values match
tvalue
Keys
Description
Valid Values in latest release
m0
Transform extension mechanism:
to reference an authority or rules for a type of transformation
transform.xml
s0, d0
Transform source/destination:
for non-languages/scripts, such as fullwidth-halfwidth conversion.
transform-destination.xml
i0
Input Method Engine transform:
Used to indicate an input method transformation, such as one used by a client-side input method. The first subfield in a sequence would typically be a 'platform' or vendor designation.
transform_ime.xml
k0
Keyboard transform:
Used to indicate a keyboard transformation, such as one used by a client-side virtual keyboard. The first subfield in a sequence would typically be a 'platform' designation, representing the platform that the keyboard is intended for. The keyboard might or might not correspond to a keyboard mapping shipped by the vendor for the platform. One or more subsequent fields may occur, but are only added where needed to distinguish from others.
transform_keyboard.xml
t0
Machine Translation:
Used to indicate content that has been machine translated, or a request for a particular type of machine translation of content. The first subfield in a sequence would typically be a 'platform' or vendor designation.
transform_mt.xml
h0
Hybrid Locale Identifiers:
h0 with the value 'hybrid' indicates that the -t- value is a language that is mixed into the main language tag to form a hybrid. For more information, and examples, see
Hybrid Locale Identifiers
transform_hybrid.xml
x0
Private use transform
transform_private_use.xml
T Extension Data Files
The overall structure of the data files is the similar to the U Extension, with the following exceptions.
In the transformed content 't' data file, the
name
attribute in a

element defines a valid field separator subtag. The
name
attribute in an enclosed

element defines a valid field subtag for the field separator subtag. For example:

The data above indicates:
"m0" is a valid field separator for the transformed content extension 't'.
field subtag "ungegn" is valid for field separator "m0".
field subtag "ungegn" was introduced in CLDR 21.
The attributes are:
name
The name of the mechanism, limited to 3-8 characters (or sequences of them). Any indirect type names are listed in 3.6.4
U Extension Data Files
description
A description of the name, with all and only that information necessary to distinguish one name from others with which it might be confused. Descriptions are not intended to provide general background information.
since
Indicates the first version of CLDR where the name appears. (Required for new items.)
alias
Alternative name, not limited in number of characters. Aliases are intended for compatibility, not to provide all possible alternate names or designations.
(Optional)
For information about the registration process, meaning, and usage of the 't' extension, see [
RFC6497
].
Compatibility with Older Identifiers
LDML version before 1.7.2 used slightly different syntax for variant subtags and locale extensions. Implementations of LDML may provide backward compatible identifier support as described in following sections.
Old Locale Extension Syntax
LDML 1.7 or older specification used different syntax for representing Unicode locale extensions. The previous definition of Unicode locale extensions had the following structure:
EBNF
old_unicode_locale_extensions
= "@" old_key "=" old_type
(";" old_key "=" old_type)*
The new specification mandates keys to be two alphanumeric characters and types to be three to eight alphanumeric characters. As the result, new codes were assigned to all existing keys and some types. For example, a new key "co" replaced the previous key "collation", a new type "phonebk" replaced the previous type "phonebook". However, the existing collation type "big5han" already satisfied the new requirement, so no new type code was assigned to the type. All new keys and types introduced after LDML 1.7 satisfy the new requirement, so they do not have aliases dedicated for the old syntax, except time zone types. The conversion between old types and new types can be done regardless of key, with one known exception (old type "traditional" is mapped to new type "trad" for collation and "traditio" for numbering system), and this relationship will be maintained in the future versions unless otherwise noted.
The new specification introduced a new field
attribute
in addition to key/type pairs in the Unicode locale extension. When it is necessary to map a new Unicode locale identifier with
attribute
field to a well-formed old locale identifier, a special key name
attribute
with the value of entire
attribute
subtags in the new identifier is used. For example, a new identifier
ja-u-xxx-yyy-ca-japanese
is mapped to an old identifier
ja@attribute=xxx-yyy;calendar=japanese
The chart below shows some example mappings between the new syntax and the old syntax.
Table:
Locale Extension Mappings
Old (LDML 1.7 or older)
New
de_DE@collation=phonebook
de_DE_u_co_phonebk
zh_Hant_TW@collation=big5han
zh_Hant_TW_u_co_big5han
th_TH@calendar=gregorian;numbers=thai
th_TH_u_ca_gregory_nu_thai
en_US_POSIX@timezone=America/Los_Angeles
en_US_u_tz_uslax_va_posix
Where the old API is supplied the bcp47 language code, or vice versa, the recommendation is to:
Have all methods that take the old syntax also take the new syntax, interpreted correctly. For example, "zh-TW-u-co-pinyin" and "zh_TW@collation=pinyin" would both be interpreted as meaning the same.
Have all methods (both for old and new syntax) accept all possible aliases for keywords and types. For example, "ar-u-ca-islamicc" would be equivalent to "ar-u-ca-islamic-civil".
The one exception is where an alias would only be well-formed with the old syntax, such as "gregorian" (for "gregory").
Where an API cannot successfully accept the alternate syntax, throw an exception (or otherwise indicate an error) so that people can detect that they are using the wrong method (or wrong input).
Provide a method that tests a purported locale ID string to determine its status:
well-formed
- syntactically correct
valid
- well-formed and only uses registered language subtags, extensions, keywords, types...
canonical
- valid and no deprecated codes or structure.
Legacy Variants
Old LDML specification allowed codes other than registered [
BCP47
] variant subtags used in Unicode language and locale identifiers for representing variations of locale data. Unicode locale identifiers including such variant codes can be converted to the new [
BCP47
] compatible identifiers by following the descriptions below:
Table:
Legacy Variant Mappings
Variant Code
Description
AALAND
Åland, variant of "
sv
" Swedish used in Finland. Use
sv_AX
to indicate this.
BOKMAL
Bokmål, variant of "
no
" Norwegian. Use primary language subtag "
nb
" to indicate this.
NYNORSK
Nynorsk, variant of "
no
" Norwegian. Use primary language subtag "
nn
" to indicate this.
POSIX
POSIX variation of locale data. Use Unicode locale extension
-u-va-posix
to indicate this.
POLYTONI
Polytonic, variant of "
el
" Greek. Use [
BCP47
] variant subtag
polyton
to indicate this.
SAAHO
The Saaho variant of Afar. Use primary language subtag "
ssy
" to indicate this.
When converting to old syntax, the Unicode locale extension "
-u-va-posix
" should be converted to the "
POSIX
" variant,
not
to old extension syntax like "
@va=posix
". This is an exception: The other mappings above should not be reversed.
Examples:
en_US_POSIX
en-US-u-va-posix
en_US_POSIX@colNumeric=yes
en-US-u-kn-va-posix
en-US-POSIX-u-kn-true
en-US-u-kn-va-posix
en-US-POSIX-u-kn-va-posix
en-US-u-kn-va-posix
👉 Note that the mapping between
en_US_POSIX
and
en-US-u-va-posix
is a conversion process, not a canonicalization process.
Relation to OpenI18n
The locale id format generally follows the description in the
OpenI18N Locale Naming Guideline
NamingGuideline
], with some enhancements. The main differences from those guidelines are that the locale id:
does not include a charset (since the data in LDML format always provides a representation of all Unicode characters. The repository is stored in UTF-8, although that can be transcoded to other encodings as well.)
adds the ability to have a variant, as in Java
adds the ability to discriminate the written language by script (or script variant).
is a superset of [
BCP47
] codes.
Transmitting Locale Information
In a world of on-demand software components, with arbitrary connections between those components, it is important to get a sense of where localization should be done, and how to transmit enough information so that it can be done at that appropriate place. End-users need to get messages localized to their languages, messages that not only contain a translation of text, but also contain variables such as date, time, number formats, and currencies formatted according to the users' conventions. The strategy for doing the so-called
JIT localization
is made up of two parts:
Store and transmit
neutral-format
data wherever possible.
Neutral-format data is data that is kept in a standard format, no matter what the local user's environment is. Neutral-format is also (loosely) called
binary data
, even though it actually could be represented in many different ways, including a textual representation such as in XML.
Such data should use accepted standards where possible, such as for currency codes.
Textual data should also be in a uniform character set (Unicode/10646) to avoid possible data corruption problems when converting between encodings.
Localize that data as "
close
" to the end-user as possible.
There are a number of advantages to this strategy. The longer the data is kept in a neutral format, the more flexible the entire system is. On a practical level, if transmitted data is neutral-format, then it is much easier to manipulate the data, debug the processing of the data, and maintain the software connections between components.
Once data has been localized into a given language, it can be quite difficult to programmatically convert that data into another format, if required. This is especially true if the data contains a mixture of translated text and formatted variables. Once information has been localized into, say, Romanian, it is much more difficult to localize that data into, say, French. Parsing is more difficult than formatting, and may run up against different ambiguities in interpreting text that has been localized, even if the original translated message text is available (which it may not be).
Moreover, the closer we are to end-user, the more we know about that user's preferred formats. If we format dates, for example, at the user's machine, then it can easily take into account any customizations that the user has specified. If the formatting is done elsewhere, either we have to transmit whatever user customizations are in play, or we only transmit the user's locale code, which may only approximate the desired format. Thus the closer the localization is to the end user, the less we need to ship all of the user's preferences around to all the places that localization could possibly need to be done.
Even though localization should be done as close to the end-user as possible, there will be cases where different components need to be aware of whatever settings are appropriate for doing the localization. Thus information such as a locale code or time zone needs to be communicated between different components.
Message Formatting and Exceptions
Windows (
FormatMessage
String.Format
), Java (
MessageFormat
) and ICU (
MessageFormat
umsg
) all provide methods of formatting variables (dates, times, etc) and inserting them at arbitrary positions in a string. This avoids the manual string concatenation that causes severe problems for localization. The question is, where to do this? It is especially important since the original code site that originates a particular message may be far down in the bowels of a component, and passed up to the top of the component with an exception. So we will take that case as representative of this class of issues.
There are circumstances where the message can be communicated with a language-neutral code, such as a numeric error code or mnemonic string key, that is understood outside of the component. If there are arguments that need to accompany that message, such as a number of files or a datetime, those need to accompany the numeric code so that when the localization is finally at some point, the full information can be presented to the end-user. This is the best case for localization.
More often, the exact messages that could originate from within the component are not known outside of the component itself; or at least they may not be known by the component that is finally displaying text to the user. In such a case, the information as to the user's locale needs to be communicated in some way to the component that is doing the localization. That locale information does not necessarily need to be communicated deep within the component; ideally, any exceptions should bundle up some language-neutral message ID, plus the arguments needed to format the message (for example, datetime), but not do the localization at the throw site. This approach has the advantages noted above for JIT localization.
In addition, exceptions are often caught at a higher level; they do not end up being displayed to any end-user at all. By avoiding the localization at the throw site, it the cost of doing formatting, when that formatting is not really necessary. In fact, in many running programs most of the exceptions that are thrown at a low level never end up being presented to an end-user, so this can have considerable performance benefits.
Unicode Language and Locale IDs
People have very slippery notions of what distinguishes a language code versus a locale code. The problem is that both are somewhat nebulous concepts.
In practice, many people use [
BCP47
] codes to mean locale codes instead of strictly language codes. It is easy to see why this came about; because [
BCP47
] includes an explicit region (territory) code, for most people it was sufficient for use as a locale code as well. For example, when typical web software receives a [
BCP47
] code, it will use it as a locale code. Other typical software will do the same: in practice, language codes and locale codes are treated interchangeably. Some people recommend distinguishing on the basis of "-" versus "_" (for example,
zh-TW
for language code,
zh_TW
for locale code), but in practice that does not work because of the free variation out in the world in the use of these separators. Notice that Windows, for example, uses "-" as a separator in its locale codes. So pragmatically one is forced to treat "-" and "_" as equivalent when interpreting either one on input.
Another reason for the conflation of these codes is that
very
little data in most systems is distinguished by region alone; currency codes and measurement systems being some of the few. Sometimes date or number formats are mentioned as regional, but that really does not make much sense. If people see the sentence "You will have to adjust the value to १,२३४.५६७ from ૭૧,૨૩૪.૫૬" (using Indic digits), they would say that sentence is simply not English. Number format is far more closely associated with language than it is with region. The same is true for date formats: people would never expect to see intermixed a date in the format "2003年4月1日" (using Kanji) in text purporting to be purely English. There are regional differences in date and number format — differences which can be important — but those are different in kind than other language differences between regions.
As far as we are concerned —
as a completely practical matter
— two languages are different if they require substantially different localized resources. Distinctions according to spoken form are important in some contexts, but the written form is by far and away the most important issue for data interchange. Unfortunately, this is not the principle used in [
ISO639
], which has the fairly unproductive notion (for data interchange) that only spoken language matters (it is also not completely consistent about this, however).
BCP47
can
express a difference if the use of written languages happens to correspond to region boundaries expressed as [
ISO3166
] region codes, and has recently added codes that allow it to express some important cases that are not distinguished by [
ISO3166
] codes. These written languages include simplified and traditional Chinese (both used in Hong Kong S.A.R.); Serbian in Latin script; Azerbaijani in Arab script, and so on.
Notice also that
currency codes
are different than
currency localizations
. The currency localizations should largely be in the language-based resource bundles, not in the territory-based resource bundles. Thus, the resource bundle
en
contains the localized mappings in English for a range of different currency codes: USD → US$, RUR → Rub, AUD → $A and so on. Of course, some currency symbols are used for more than one currency, and in such cases specializations appear in the territory-based bundles. Continuing the example,
en_US
would have USD → $, while
en_AU
would have AUD → $. (In protocols, the currency codes should always accompany any currency amounts; otherwise the data is ambiguous, and software is forced to use the user's territory to guess at the currency. For some informal discussion of this, see
JIT Localization
.)
Written Language
Criteria for what makes a written language should be purely pragmatic;
what would copy-editors say?
If one gave them text like the following, they would respond that is far from acceptable English for publication, and ask for it to be redone:
"Theatre Center News: The date of the last version of this document was 2003年3月20日. A copy can be obtained for $50,0 or 1.234,57 грн. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Behdad Esfahbod, Ahmed Talaat, Eric Mader, Asmus Freytag, Avery Bishop, and Doug Felt."
So one would change it to either B or C below, depending on which orthographic variant of English was the target for the publication:
"Theater Center News: The date of the last version of this document was 3/20/2003. A copy can be obtained for $50.00 or 1,234.57 Ukrainian hryvni. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Ahmed Talaat, Asmus Freytag, Avery Bishop, Behdad Esfahbod, Doug Felt, Eric Mader."
"Theatre Centre News: The date of the last version of this document was 20/3/2003. A copy can be obtained for $50.00 or 1,234.57 Ukrainian hryvni. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Ahmed Talaat, Asmus Freytag, Avery Bishop, Behdad Esfahbod, Doug Felt, Eric Mader."
Clearly there are many acceptable variations on this text. For example, copy editors might still quibble with the use of first versus last name sorting in the list, but clearly the first list was
not
acceptable English alphabetical order. And in quoting a name, like "Theatre Centre News", one may leave it in the source orthography even if it differs from the publication target orthography. And so on. However, just as clearly, there are limits on what is acceptable English, and "2003年3月20日", for example, is
not
Note that the language of locale data may differ from the language of localized software or web sites, when those latter are not localized into the user's preferred language. In such cases, the kind of incongruous juxtapositions described above may well appear, but this situation is usually preferable to forcing unfamiliar date or number formats on the user as well.
Hybrid Locale Identifiers
Hybrid locales have intermixed content from 2 (or more) languages, often with one language's grammatical structure applied to words in another. These are commonly referred to with portmanteau words such as
Franglais,
Spanglish
or
Denglish
. Hybrid locales do not
not
reference text simply containing two languages: a book of parallel text containing English and French, such as the following, is not Franglais:
On the 24th of May, 1863, my uncle, Professor Liedenbrock, rushed into his little house, No. 19 Königstrasse, one of the oldest streets in the oldest portion of the city of Hamburg…
Le 24 mai 1863, un dimanche, mon oncle, le professeur Lidenbrock, revint précipitamment vers sa petite maison située au numéro 19 de Königstrasse, l’une des plus anciennes rues du vieux quartier de Hambourg…
While text in a document can be tagged as partly in one language and partly in another, that is not the same having a hybrid locale. There is a difference between having a Spanglish document, and a Spanish document that has some passages quoted in English. Fine-grained tagging doesn't handle grammatical combinations like Tanglish “Enna matteru?” (
What’s the matter?
), which is neither standard Tamil nor standard English. More importantly, it doesn’t work for the very common use case for a
unicode_locale_id
locale selection
To communicate requests for localized content and internationalization services, locales are used. When people pick a language from a menu, internally they are picking a locale (en-GB, es-419, etc.). To allow an application to support Spanglish or Hinglish locale selection,
unicode_locale_id
s can represent hybrid locales using the T Extension key-value 'h0-hybrid'. (For more information on the T extension, see
Unicode BCP 47 T Extension
However, if users typically expect their language in a non-default script to contain a significant amount of text due to lexical borrowing, then the -t- and hybrid subtags may be omitted. An example of this is when Hindi is written in Latin script since Romanized Hindi typically contains a significant amount of English text, ‘hi-Latn’ can be used instead of ‘hi-Latn-t-en-h0-hybrid’.
This tends to work better in implementations that don't yet handle the -t- extension.
Examples:
Locale ID
Base script
Hybrid name
Description
hi-t-
en-h0-hybrid
Deva
Hinglish
Hindi-English hybrid where the script is Devanagari*
hi-Latn-t-
en-h0-hybrid
Latin
Hinglish
Hindi-English hybrid where the script is Latin*
hi-Latn
Latin
Hinglish
Hindi written in Latin script; in practice usually a hybrid with English
ta-t-
en-h0-hybrid
Tamil
Tanglish
Tamil-English hybrid where the script is Tamil*
...
en-t-
hi-h0-hybrid
Latin
Hinglish
English-Hindi hybrid where the script is Latin*
en-t-
zh-h0-hybrid
Latin
Chinglish
English-Chinese hybrid where the script is Latin*
...
* When used as a request for international services (such as date formatting), the request is for everything to be in the base script if possible. When used to tag arbitrary content on a coarse level, the expectation is that it be the predominant script — that is, there may be certain passages or phrases that are in the other script but are not tagged on a fine-grained level.
Note: The
unicode_language_id
should be the language used as the ‘scaffold’: for the fallback locale for internationalization services, typically used for more of the core vocabulary/structure in the content. Thus where Hindi is the scaffold, Hinglish should be represented as hi-t-en-h0-hybrid (when written in Devanagari script) or hi-Latn-t-en-h0-hybrid (when written in Latin characters). Where English is the scaffold, Hinglish should be represented as en-t-hi-h0-hybrid (or possibly en-Deva-t-hi-h0-hybrid).
The value of -t- is a full
unicode_language_id
, and can contain a subtag for the region where it is important to include it, as in the following. The value can also include the script, although that is not normally included: the only instance where it should be is where the content of the source text varies by script. So because zh-Hant has different vocabulary and expressions, it could make sense to have en-t-zh-hant to make that distinction.
Note: The default script for the language is computed without reference to the hybrid subtags. Thus the default script for 'ru' is “Cyrl”, no matter what the source is in the -t- tag.
Locale ID
Base script
Hybrid name
Description
ru-t-
en
-h0-hybrid
Cyrillic
Runglish
Russian with an admixture of
American English
ru-t-
en-gb
-h0-hybrid
Cyrillic
Runglish
Russian with an admixture of
British English
ru-
Latn
-t-en-gb-h0-hybrid
Latin
Runglish
Russian with an admixture of British English
en-t-
zh-h0-hybrid
Latin
Chinglish
American English with an admixture of
Chinese (Simplified Mandarin Chinese)
en-t-
zh-hant-h0-hybrid
Latin
Chinglish
American English with an admixture of
Chinese (Traditional Mandarin Chinese)
Should there ever be strong need for hybrids of more than two languages or for other purposes such as hybrid languages as the source of translated content, additional structure could be added.
Validity Data

The directory
common/validity
contains machine-readable data for validating the language, region, script, and variant subtags, as well as currency, subdivisions and measure units. Each file contains a number of subtags with the following
idStatus
values:
regular
— the standard codes used for the specific type of subtag
special
— certain exceptional language codes like 'mul'
(languages only)
unknown
— the code used to indicate the "unknown", "undetermined" or "invalid" values. For more information, see
Unknown or Invalid Identifiers
macroregion
— the standard codes that are macroregions
(for regions only).
Note that some two-letter region codes are macroregions, and (in the future) some three-digit codes may be regular codes.
For details as to which regions are contained within which macroregions, see the

element of the supplemental data.
deprecated
— codes that should not be used. The

element in the supplementalMeta file contains more information about these codes, and which codes should be used instead.
private_use
— codes that, for CLDR, are considered private use. Note that some private-use codes in a source standard such as BCP 47 have defined CLDR semantics, and are considered regular codes. For more information, see
Private Use Codes
reserved
— codes that are private use in a source standard, but are reserved for future use as regular codes by CLDR.
The list of subtags for each idStatus use a compact format as a space-delimited list of StringRanges, as defined in
Section String Range](#String_Range).
The separator for each StringRange is a "~".
Each measure unit is a sequence of subtags, such as “angle-arc-minute”. The first subtag provides a general “category” of the unit.
In version 28.0, the subdivisions in the validity files used the ISO format, uppercase with a hyphen separating two components, instead of the BCP 47 format.
Locale Inheritance and Matching
The XML format relies on an inheritance model, whereby the resources are collected into
bundles
, and the bundles organized into a tree. Data for the many Spanish locales does not need to be duplicated across all of the countries having Spanish as a national language. Instead, common data is collected in the Spanish language locale, and territory locales only need to supply differences. The parent of all of the language locales is a generic locale known as
root
. Wherever possible, the resources in the root are language & territory neutral. For example, the collation (sorting) order in the root is based on the [
DUCET
] (see
Root Collation
). Since English language collation has the same ordering as the root locale, the 'en' locale data does not need to supply any collation data, nor do the 'en_US', 'en_GB' or the any of the various other locales that use English.
Given a particular locale id "en_US_someVariant", the default search chain for a particular resource is the following.
en_US_someVariant
en_US
en
root
The inheritance is often not simple truncation, as will be seen later in this section.
The default search chain is slighly different for multiple variants.
In that case, the inheritance chain covers all combinations of variants, with longest number of variants first, and otherwise in alphabetical order.
For example, where the requested locale ID is en_fonipa_scouse, the inheritance chain is as follows:
en_GB_fonipa_scouse
en_GB_scouse_fonipa // extra step, only needed if not canonical
en_GB_fonipa
en_GB_scouse // extra step
en_GB
en
If the data for the implementation performing the inheritance doesn't require canonical locale identifiers, then extra locale IDs need to be inserted in the chain.
That is indicated in the example above, marked with "only needed if not canonical".
These would would include all combinations of variants that are not in canonical order, inserted in alphabetical order.
Note that the order of multiple variants in canonical locale identifiers is alphabetical, as per
5. Canonicalizing Syntax
in
Annex C. LocaleId Canonicalization
If a type and key are supplied in the locale id, then logically the chain from that id to the root is searched for a resource tag with a given type, all the way up to root. If no resource is found with that tag and type, then the chain is searched again without the type.
Thus the data for any given locale will only contain resources that are different from the parent locale. For example, most territory locales will inherit the bulk of their data from the language locale: "en" will contain the bulk of the data: "en_IE" will only contain a few items like currency. All data that is inherited from a parent is presumed to be valid, just as valid as if it were physically present in the file. This provides for much smaller resource bundles, and much simpler (and less error-prone) maintenance. At the script or region level, the "primary" child locale will be empty, since its parent will contain all of the appropriate resources for it. For more information see
CLDR Information:
Default Content
Certain data items depend only on the region specified in a locale id (by a
unicode_region_subtag
or an “rg”
Region Override
key), and are obtained from supplemental data rather than through locale resources. For example:
The currency for the specified region (see
Supplemental Currency Data
The measurement system for the specified region (see
Measurement System Data
The week conventions for the specified region (see
Week Data
(For more information on the specific items handled this way, see
Territory-Based Preferences
.) These items will be correct for the specified region regardless of whether a locale bundle actually exists with the same combination of language and region as in the locale id. For example, suppose data is requested for the locale id "fr_US" and there is no bundle for that combination. Data obtained via locale inheritance, such as currency patterns and currency symbols, will be obtained from the parent locale "fr". However, currency amounts would be formatted by default using US dollars, just displayed in the manner governed by the locale "fr". When a locale id does not specify a region, the region-specific items such as those above are obtained from the likely region for the locale (obtained via
Likely Subtags
).
For the relationship between Inheritance, DefaultContent, LikelySubtags, and LocaleMatching, see
Inheritance vs Related Information
Lookup
If a language has more than one script in customary modern use, then the CLDR file structure in common/main follows the following model:
lang
lang_script
lang_script_region
lang_region (aliases to lang_script_region based on likely subtags)
Bundle vs Item Lookup
There are actually two different kinds of inheritance fallback:
resource bundle lookup
and
resource item lookup
. For the former, a process is looking to find the first, best resource bundle it can; for the later, it is fallback within bundles on individual items, like the translated name for the region "CN" in Breton.
These are closely related, but distinct, processes. They are illustrated in the table
Lookup Differences
, where "key" stands for zero or more key/type pairs. Logically speaking, when looking up an item for a given locale, you first do a resource bundle lookup to find the best bundle for the locale, then you do an inherited item lookup starting with that resource bundle.
The table
Lookup Differences
uses the naïve resource bundle lookup for illustration. More sophisticated systems will get far better results for resource bundle lookup if they use the algorithm described in
Language Matching
. That algorithm takes into account both the user’s desired locale(s) and the application’s supported locales, in order to get the best match.
If the naïve resource bundle lookup is used, the desired locale needs to be canonicalized using 4.3
Likely Subtags
and the supplemental alias information, so that locales that CLDR considers identical are treated as such. Thus eng-Latn-GB should be mapped to en-GB, and cmn-TW mapped to zh-Hant-TW.
The initial bundle accessed during resource bundle lookup should not contain a script subtag unless, according to likely subtags, the script is required to disambiguate the locale. For example,
zh-Hant-TW
should start lookup at
zh-TW
(since
zh-TW
implies
Hant
), and
de-Latn-LI
should start at
de-LI
(since
de
implies
Latn
and
de-LI
does not have its own entry in likely subtags).
For the purposes of CLDR, everything with the

dtd is treated logically as if it is one resource bundle, even if the implementation separates data into separate physical resource bundles. For example, suppose that there is a main XML file for Nama (naq), but there are no

elements for it because the units are all inherited from root. If the

elements are separated into a separate data tree for modularity in the implementation, the Nama

resource bundle would be empty. However, for purposes of resource-bundle lookup the resource bundle lookup still stops at naq.xml.
Table:
Lookup Differences
Lookup Type
Example
Comments
Resource bundle
lookup
se-FI →
se →
default‑locale* →
root
* The default-locale may have its own inheritance change; for example, it may be "en-GB → en" In that case, the chain is expanded by inserting the chain, resulting in:
se-FI →
se →
fi →
en-GB →
en →
root
Inherited item
lookup
se-FI+key →
se+key →
root_alias*+key
→ root+key
* If there is a root_alias to another key or locale, then insert that entire chain. For example, suppose that months for another calendar system have a root alias to Gregorian months. In that case, the root alias would change the key, and retry from se-FI downward. This can happen multiple times.
se-FI+key →
se+key →
root_alias*+key →
se-FI+key2 →
se+key2 →
root_alias*+key2 →
root+key2
Both the resource bundle inheritance and the inherited item inheritance use the parentLocale data, where available, instead of simple truncation.
The fallback is a bit different for these two cases; internal aliases and keys are not involved in the bundle lookup, and the default locale is not involved in the item lookup. If the default-locale were used in the resource-item lookup, then strange results will occur. For example, suppose that the default locale is Swedish, and there is a Nama locale but no specific inherited item for collation. If the default-locale were used in resource-item lookup, it would produce odd and unexpected results for Nama sorting.
The default locale is not even always used in resource bundle inheritance. For the following services, the fallback is always directly to the root locale rather than through default locale.
collation
break iteration
case mapping
transliteration
The lookup for transliteration is yet more complicated because of the interplay of source and target locales: see
Part 2 General,
Inheritance.
Thus if there is no Akan locale, for example, asking for a collation for Akan should produce the root collation,
not the Swedish collation.
The inherited item lookup must remain stable, because the resources are built with a certain fallback in mind; changing the core fallback order can render the bundle structure incoherent.
Resource bundle lookup, on the other hand, is more flexible; changes in the view of the "best" match between the input request and the output bundle are more tolerant, when represent overall improvements for users. For more information, see
A.1 Element fallback
Where the LDML inheritance relationship does not match a target system, such as POSIX, the data logically should be fully resolved in converting to a format for use by that system, by adding
all
inherited data to each locale data set.
For a more complete description of how inheritance applies to data, and the use of keywords, see
Inheritance
The locale data does not contain general character properties that are derived from the
Unicode Character Database
UAX44
]. That data being common across locales, it is not duplicated in the bundles. Constructing a POSIX locale from the CLDR data requires use of UCD data. In addition, POSIX locales may also specify the character encoding, which requires the data to be transformed into that target encoding.
Warning:
If a locale has a different script than its parent (for example, sr_Latn), then special attention must be paid to make sure that all inheritance is covered. For example, auxiliary exemplar characters may need to be empty ("[]") to block inheritance.
Empty Override:
There is one special value reserved in LDML to indicate that a child locale is to have no value for a path, even if the parent locale has a value for that path. That value is "∅∅∅". For example, if there is no phrase for "two days ago" in a language, that can be indicated with:

∅∅∅
Lateral Inheritance
Lateral Inheritance
is where resources are inherited from within the same locale,
before inheriting from the parent
. This is used for the following element@attribute instances:
Element @Attribute
Source
Context
currency
@pattern
currencyFormat
numberSystem
defaultNumberingSystem
, unless otherwise specified*
currencyFormatLength
type=none, unless otherwise specified
currencyFormat
type="standard"
, unless otherwise specified
currency
@decimal
symbols
@decimal
numberSystem
defaultNumberingSystem
, unless otherwise specified
currency
@group
symbols
@group
numberSystem
defaultNumberingSystem
, unless otherwise specified
* The "unless otherwise specified" clause is for when an API or other context indicates a different choice, such as currencyFormat type="accounting".
For example, with /currency [@type="CVE"], the decimal symbol for almost all locales is the value from symbols/decimal, but for pt_CV it is explicitly
$
The following attributes use lateral inheritance for
all elements
with the DTD root = ldml, except where otherwise noted. The process is applied recursively.
Attribute
Fallback
Exception Elements
alt
no alt attribute
none
case
"nominative" → ∅
caseMinimalPairs
gender
default_gender(locale) → ∅
genderMinimalPairs
count
plural_rules(locale, x) → "other" → ∅
minDays
pluralMinimalPairs
ordinal
plural_rules(locale, x) → "other" → ∅
ordinalMinimalPairs
The gender fallback is to neuter if the locale has a neuter gender, otherwise masculine. This may be extended in the future if necessary. See also
Part 2, Grammatical Features
For example, if there is no value for a path, and that path has a [@count="x"] attribute and value, then:
If "x" is numeric, the path falls back to the path with [@count=«the plural rules category for x for that locale»], within that the same locale.
For example, [@count="0"] for English falls back to [@count="other"], while for French falls back to [@count="one"].
If "x" is anything but "other", it falls back to a path [@count="other"], within that the same locale.
If "x" is "other", it falls back to the path that is completely missing the count item, within that the same locale.
If there is no value for that path the same locale, the same process is used for the
original path
in the parent locale.
A path may have multiple attributes with lateral inheritance. In such a case, all of the combinations are tried, and in the order supplied above. For example (this is an extreme case):
/compoundUnitPattern1[@count="few"][@gender="feminine"][@case="accusative">] →
/compoundUnitPattern1[@count="few"][@gender="feminine"][@case="nominative">] →
/compoundUnitPattern1[@count="few"][@gender="feminine"] →
/compoundUnitPattern1[@count="few"][@gender="neuter"][@case="accusative">] →
/compoundUnitPattern1[@count="few"][@gender="neuter"][@case="nominative">] →
/compoundUnitPattern1[@count="few"][@gender="neuter"] →
/compoundUnitPattern1[@count="few"][@case="accusative">] →
/compoundUnitPattern1[@count="few"][@case="nominative">] →
/compoundUnitPattern1[@count="few"] →

/compoundUnitPattern1[@count="other"][@gender="feminine"][@case="accusative">] →
/compoundUnitPattern1[@count="other"][@gender="feminine"][@case="nominative">] →
/compoundUnitPattern1[@count="other"][@gender="feminine"] →
/compoundUnitPattern1[@count="other"][@gender="neuter"][@case="accusative">] →
/compoundUnitPattern1[@count="other"][@gender="neuter"][@case="nominative">] →
/compoundUnitPattern1[@count="other"][@gender="neuter"] →
/compoundUnitPattern1[@count="other"][@case="accusative">] →
/compoundUnitPattern1[@count="other"][@case="nominative">] →
/compoundUnitPattern1[@count="other"] →

/compoundUnitPattern1[@gender="feminine"][@case="accusative">] →
/compoundUnitPattern1[@gender="feminine"][@case="nominative">] →
/compoundUnitPattern1[@gender="feminine"] →
/compoundUnitPattern1[@gender="neuter"][@case="accusative">] →
/compoundUnitPattern1[@gender="neuter"][@case="nominative">] →
/compoundUnitPattern1[@gender="neuter"] →
/compoundUnitPattern1[@case="accusative">] →
/compoundUnitPattern1[@case="nominative">] →
/compoundUnitPattern1
Examples:
Table:
Count Fallback: normal
Locale
Path
fr-CA
//ldml/units/unitLength[@type="narrow"]/unit[@type="mass-gram"]/unitPattern[@count="x"]
fr-CA
//ldml/units/unitLength[@type="narrow"]/unit[@type="mass-gram"]/unitPattern[@count="other"]
fr
//ldml/units/unitLength[@type="narrow"]/unit[@type="mass-gram"]/unitPattern[@count="x"]
fr
//ldml/units/unitLength[@type="narrow"]/unit[@type="mass-gram"]/unitPattern[@count="other"]
root
//ldml/units/unitLength[@type="narrow"]/unit[@type="mass-gram"]/unitPattern[@count="x"]
root
//ldml/units/unitLength[@type="narrow"]/unit[@type="mass-gram"]/unitPattern[@count="other"]
Note that there may also be an alias in root that changes the path and starts again from the requested locale, such as:

Table:
Count Fallback: currency
Locale
Path
fr-CA
//ldml/numbers/currencies/currency[@type="CAD"]/displayName[@count="x"]
fr-CA
//ldml/numbers/currencies/currency[@type="CAD"]/displayName[@count="other"]
fr-CA
//ldml/numbers/currencies/currency[@type="CAD"]/displayName
fr
//ldml/numbers/currencies/currency[@type="CAD"]/displayName[@count="x"]
fr
//ldml/numbers/currencies/currency[@type="CAD"]/displayName[@count="other"]
fr
//ldml/numbers/currencies/currency[@type="CAD"]/displayName
root
//ldml/numbers/currencies/currency[@type="CAD"]/displayName[@count="x"]
root
//ldml/numbers/currencies/currency[@type="CAD"]/displayName[@count="other"]
root
//ldml/numbers/currencies/currency[@type="CAD"]/displayName
Inheritance Marker
There is a special
Inheritance Marker
used in the main repository, which has the value ↑↑↑. For example:
↑↑↑
It is used created during data submission to record that the inherited value has been verified for the current locale and path.
For example, the above was used in de_CH to indicate that the following was not only correct for de, but also for de_CH.
Abchasisch
It is not needed or used in the released data, because conformant implementations produce the inherited value whether the element is present with a value of ↑↑↑, or is completely absent.
Parent Locales

When the component does not occur, that is referred to as the ‘main’ component.
Otherwise the component value typically corresponds to elements and their children, such as ‘collations’ or ‘plurals’.
There may be more than one component value (space separated):
in that case the information applies to all the components listed.
The basic inheritance model for locales of the form
lang_script_region_variant1_…variantN
is to truncate from the end.
That is,
remove the _u and _t extensions, then remove the last _ and following tag, then restore the extensions.
For example
sr_Cyrl_ME
sr_Cyrl
sr
In some cases, the normal truncation inheritance does not function well.
For example, if the truncation algorithm changes script,
then a mixture of child and parent textual data is a mishmash of different scripts.
Thus there are two cases where the truncation inheritance needs to be overridden:
When the parent locale would have a different script, and text would be mixed.
In certain exceptional circumstances where the 'truncation' parent needs to be adjusted.
The
parentLocale
element is used to override the normal inheritance when accessing CLDR data.
For case 1, there is a special attribute and value,
localeRules="nonlikelyScript"
which specifies
all locales
of the form
lang_script
wherever the
script
is
not
the likely script for
lang
For migration, the previous short list of locales (a subset of the nonlikelyScript locales) is retained,
but those locales are slated for removal in the future.
For example,
ru_Latn
is not included in the short list but is included (programmatically) in the rule.
/>
The
localeRules
is used for the main component, for example.
It is not used to components where text is not mixed,
such as the collations component or the plurals component.
For case 2, the children and parent share the same primary language, but the region is changed.
For example:

There are certain components that require addenda to the common parent fallback rules.
For a locale like
zh_Hant
in the example above,
the
parentLocale
element would dictate the parent as
root
when referring to main locale data,
but for collation data, the parent locale should still be
zh
even though the
parentLocale
element is present for that locale.
To address this, components can have their own fallback rules that inherit from the common rules
and add additional parents that supplement or override the common rules:

Note: When components were first introduced, the component-specific parent locales were be merged with the main parent locales.
This was determined to be an error, and the component-specific parent locales are now not merged,
but instead are treated as stand-alone.
Since parentLocale information is not localizable on a per locale basis,
the parentLocale information is contained in CLDR’s
supplemental data.
When a
parentLocale
element is used to override normal inheritance, the following guidelines apply in most cases:
If X is the parentLocale of Y, then either X is the root locale, or X has the same base language code as Y.
For example, the parent of
en
cannot be
fr
, and the parent of
en_YY
cannot be
fr
or
fr_XX
If X is the parentLocale of Y, Y must not be a base language locale. For example, the parent of
en
cannot be
en_XX
There may be specific exceptions to these for certain closely-related languages or language-script combinations, for example:
no
may be the parent of
nb
and
nn
en_IN
may be the parent of
hi_Latn
(the parent is one of the languages for a child that is effectively a hybrid of two languages in
Latn
script)
There are certain invariants that must always be true:
The parent must either be the root locale or have the same script as the child. This rule applies to component=main.
There must never be cycles, such as: X parent of Y ... parent of X.
Following the inheritance path, using parentLocale where available and otherwise truncating the locale, must always lead eventually to the root locale.
Region-Priority Inheritance
Certain data may be more appropriate to store with the region as the primary key instead of language. This is often needed for regional user preferences, such as week info, calendar system, and measurement system. All resources matched by an entry in

should use this type of inheritance.
The default search chain for region-priority inheritance removes the language subtag before the region subtag, as follows:
en_US_someVariant
en_US
US
001
Equivalently as BCP-47:
en-US-variant
en-US
und-US
und
Before running region-priority inheritance, the locale should be normalized as follows:
If the locale contains the
-u-rg
Unicode BCP-47 locale extension, the region subtag should be set to the
-u-rg
region. For example,
en-US-u-rg-gbzzzz
should normalize to
en-GB
when running region-priority inheritance.
If, after performing step 1, the locale is missing the region subtag (
language
or
language_script
), the region subtag should be filled in from likely subtags data. For example,
en
should become
en-US
before running region-priority inheritance.
Note that region-priority inheritance does not currently make use of parent locales or territory containment, but it may in the future.
Inheritance and Validity
The following describes in more detail how to determine the exact inheritance of elements, and the validity of a given element in LDML.
Definitions
Ordered
elements are those whose sequence in the XML file is important; that is, changing the order of those elements can make a difference in the interpretation of the data. These are marked with the
@ORDRED
annotation in the dtd file. For example, consider the following in
ldmlSupplemental.dtd

In the file
languageInfo.xml
, we find the following.

The ordering among the
languageMatch
items is important, because the
*_*
must only be matched
after
all the explicit scripts have been.
The ordered elements also
block
inheritance in files governed by
ldml.dtd
. That is, because the elements are ordered, there is no way to tell where an inherited element from a parent locale would be in that sequence.
Attributes that serve to distinguish multiple elements at the same level are called
distinguishing
attributes. For example, the
type
attribute distinguishes different elements in lists of translations, such as:
Afar
Abkhazian
Distinguishing attributes affect inheritance; two elements with different distinguishing attributes are treated as different for purposes of inheritance. For more information, see
Valid Attribute Values
. Other attributes are called value attributes. Value attributes do not affect inheritance, and elements with value attributes may not have child elements (see
XML Format
).
Non-distinguishing attributes are identified by
DTD Annotations
such as
@VALUE
For any element in an XML file,
an element chain
is a resolved [
XPath
] leading from the root to an element, with attributes on each element in alphabetical order. So in, say,
we may have:

Αραβικά
...
Which gives the following element chains (among others):
//ldml/identity/version[@number="1.1"]
//ldml/localeDisplayNames/languages/language[@type="ar"]
An element chain A is an
extension
of an element chain B if B is equivalent to an initial portion of A. For example, #2 below is an extension of #1. (Equivalent, depending on the tree, may not be "identical to". See below for an example.)
//ldml/localeDisplayNames
//ldml/localeDisplayNames/languages/language[@type="ar"]
An LDML file can be thought of as an ordered list of
element pairs
: , where the element chains are all the chains for the end-nodes. (This works because of restrictions on the structure of LDML, including that it does not allow mixed content.) The ordering is the ordering that the element chains are found in the file, and thus determined by the DTD.
For example, some of those pairs would be the following. Notice that the first has the null string as element contents.
//ldml/identity/version[@number="1.1"]
""
//ldml/localeDisplayNames/languages/language[@type="ar"]
"Αραβικά"
Note: There are two exceptions to this:
Blocking nodes and their contents are treated as a single end node.
In terms of computing inheritance, the element pair consists of the element chain plus all distinguishing attributes; the value consists of the value (if any) plus any nondistinguishing attributes.
Thus instead of the element pair being (a) below, it is (b):
//ldml/dates/calendars/calendar[@type='gregorian']/week/weekendStart[@day='sun'][@time='00:00']
""
//ldml/dates/calendars/calendar[@type='gregorian']/week/weekendStart
[@day='sun'][@time='00:00']
Two LDML element chains are
equivalent
when they would be identical if all attributes and their values were removed — except for distinguishing attributes. Thus the following are equivalent:
//ldml/localeDisplayNames/languages/language[@type="ar"]
//ldml/localeDisplayNames/languages/language[@type="ar"][@draft="unconfirmed"]
For any locale ID, a
locale chain
is an ordered list starting with the root and leading down to the ID. For example:

Resolved Data File
To produce fully resolved locale data file from CLDR for a locale ID L, you start with L, and successively add unique items from the parent locales until you get up to root. More formally, this can be expressed as the following procedure.
Let Result be initially L.
For each Li in the locale chain for L, starting at L and going up to root:
Let Temp be a copy of the pairs in the LDML file for Li
Replace each alias in Temp by the resolved list of pairs it points to.
The resolved list of pairs is obtained by recursively applying this procedure.
That alias now blocks any inheritance from the parent. (See
Common Elements
for an example.)
For each element pair P in Temp:
If P does not contain a blocking element, and Result does not have an element pair Q with an equivalent element chain, add P to Result.
Notes:
When adding an element pair to a result, it has to go in the right order for it to be valid according to the DTD.
The identity element and its children are unaffected by resolution.
The LDML data must be constructed so as to avoid circularity in step 2.2.
Valid Data
The attribute
draft="x"
in LDML means that the data has not been approved by the subcommittee. (For more information, see
Process
). However, some data that is not explicitly marked as
draft
may be implicitly
draft
, either because it inherits it from a parent, or from an enclosing element.
Example 2.
Suppose that new locale data is added for af (Afrikaans). To indicate that all of the data is
unconfirmed
, the attribute can be added to the top level.

...
...

Any data can be added to that file, and the status will all be
draft="unconfirmed"
. Once an item is vetted—
whether it is inherited or explicitly in the file
—then its status can be changed to
approved
. This can be done either by leaving
draft="unconfirmed"
on the enclosing element and marking the child with
draft="approved"
, such as:

...
...

However, normally the draft attributes should be canonicalized, which means they are pushed down to leaf nodes as described in
Canonical Form
. If an LDML file does have draft attributes that are not on leaf nodes, the file should be interpreted as if it were the canonicalized version of that file.
More formally, here is how to determine whether data for an element chain E is implicitly or explicitly draft, given a locale L. Sections 1, 2, and 4 are simply formalizations of what is in LDML already. Item 3 adds the new element.
Checking for Draft Status
Parent Locale Inheritance
Walk through the locale chain until you find a locale ID L' with a data file D. (L' may equal L).
Produce the fully resolved data file D' for D.
In D', find the first element pair whose element chain E' is either equivalent to or an extension of E.
If there is no such E', return
true
If E' is not equivalent to E, truncate E' to the length of E.
Enclosing Element Inheritance
Walk through the elements in E', from back to front.
If you ever encounter draft=
, return
If L' = L, return
false
Missing File Inheritance
Otherwise, walk again through the elements in E', from back to front.
If you encounter a
validSubLocales
attribute (deprecated):
If L is in the attribute value, return
false
Otherwise return
true
Otherwise
Return
true
The
validSubLocales
in the most specific (farthest from root file) locale file "wins" through the full resolution step (data from more specific files replacing data from less specific ones).
Keyword and Default Resolution
When accessing data based on keywords, the following process is used. Consider the following example:
The locale 'de' has collation types A, B, C, and no

element
The locale 'de_CH' has

Here are the searches for various combinations.
User Input
Lookup in Locale
For
Comment
de_CH
no keyword
de_CH
default collation type
finds "B"
de_CH
collation type=B
not found
de
collation type=B
found
de
no keyword
de
default collation type
not found
root
default collation type
finds "standard"
de
collation type=standard
not found
root
collation type=standard
found
de_u_co_A
de
collation type=A
found
de_u_co_standard
de
collation type=standard
not found
root
collation type=standard
found
de_u_co_foobar
de
collation type=foobar
not found
root
collation type=foobar
not found, starts looking for default
de
default collation type
not found
root
default collation type
finds "standard"
de
collation type=standard
not found
root
collation type=standard
found
Examples of "search" collator lookup; 'de' has a language-specific version, but 'en' does not:
User Input
Lookup in Locale
For
Comment
de_CH_u_co_search
de_CH
collation type=search
not found
de
collation type=search
found
en_US_u_co_search
en_US
collation type=search
not found
en
collation type=search
not found
root
collation type=search
found
Examples of lookup for Chinese collation types. Note:
All of the Chinese-specific collation types are provided in the 'zh' locale
For 'zh' the

element specifies "pinyin"; for 'zh_Hant' the

element specifies "stroke". However any of the available Chinese collation types can be explicitly requested for any Chinese locale.
User Input
Lookup in Locale
For
Comment
zh_Hant
no keyword
zh_Hant
default collation type
finds "stroke"
zh_Hant
collation type=stroke
not found
zh
collation type=stroke
found
zh_Hant_HK_u_co_pinyin
zh_Hant_HK
collation type=pinyin
not found
zh_Hant
collation type=pinyin
not found
zh
collation type=pinyin
found
zh
no keyword
zh
default collation type
finds "pinyin"
zh
collation type=pinyin
found
Note:
It is an invariant that the default in root for a given element must
always be a value that exists in root. So you can not have the following in root:

...
...

For identifiers, such as language codes, script codes, region codes, variant codes, types, keywords, currency symbols or currency display names, the default value is the identifier itself whenever no value is found in the root. Thus if there is no display name for the region code 'QA' in root, then the display name is simply 'QA'.
Inheritance vs Related Information
There are related types of data and processing that are easy to confuse:
Inheritance
Part of the internal mechanism used by CLDR to organize and manage locale data. This is used to share common resources, and ease maintenance, and provide the best fallback behavior in the absence of data.
Should not be used for locale matching or likely subtags.
Example:
parent(en_AU) ⇒ en_001
parent(en_001) ⇒ en
parent(en) ⇒ root
Data:
supplementalData.xml
Spec:
Section
4.2 Inheritance and Validity
DefaultContent
Part of the internal mechanism used by CLDR to manage locale data. A particular sublocale is designated the defaultContent for a parent, so that the parent exhibits consistent behavior.
Should not be used for locale matching or likely subtags.
Example:
addLikelySubtags(sr-ME) ⇒ sr-Latn-ME, minimize(de-Latn-DE) ⇒ de
Data:
supplementalMetadata.xml
Spec:
Part 6: Section 9.3
Default Content
LikelySubtags
Provides most likely full subtag (script and region) in the absence of other information. A core component of LocaleMatching.
Example:
addLikelySubtags(zh) ⇒ zh-Hans-CN
addLikelySubtags(zh-TW) ⇒ zh-Hant-TW
addLikelySubtags(zh-Hant) ⇒ zh-Hant-TW
minimize(zh-Hans-CN, favorRegion|favorScript) ⇒ zh
minimize(zh-Hant-TW, favorRegion) ⇒ zh-TW
minimize(zh-Hant-TW, favorScript) ⇒ zh-Hant
Data:
likelySubtags.xml
Spec:
Section
4.3 Likely Subtags
LocaleMatching
Provides the best match for the user’s language(s) among an application’s supported languages.
Example:
bestLocale(userLangs=, appLangs=) ⇒ fr-CA
Data:
languageInfo.xml
Spec:
Section
4.4 Language Matching
Likely Subtags

There are a number of situations where it is useful to be able to find the most likely language, script, or region. For example, given the language "zh" and the region "TW", what is the most likely script? Given the script "Thai" what is the most likely language or region? Given the region TW, what is the most likely language and script?
Conversely, given a locale, it is useful to find out which fields (language, script, or region) may be superfluous, in the sense that they contain the likely tags. For example, "en_Latn" can be simplified down to "en" since "Latn" is the likely script for "en"; "ja_Jpan_JP" can be simplified down to "ja".
The
likelySubtag
supplemental data provides default information for computing these values. This data is based on the default content data, the population data, and the suppress-script data in [
BCP47
]. It is heuristically derived, and may change over time.
For the relationship between Inheritance, DefaultContent, LikelySubtags, and LocaleMatching, see
Inheritance vs Related Information
To look up data in the table, see if a locale matches one of the
from
attribute values. If so, fetch the corresponding
to
attribute value. For example, the Chinese data looks like the following:

So looking up "zh_TW" returns "zh_Hant_TW", while looking up "zh" returns "zh_Hans_CN".
In more detail, the data is designed to be used in the following operations.
Like other CLDR operations, these operations can also be used with language tags having [
BCP47
] syntax, with the appropriate changes to the data.
An implementation may choose to exclude language tags with the language subtag "und" from the following operation. In such a case, only the canonicalization is done. An implementation can declare that it is doing the exclusion, or can take a parameter that controls whether or not to do it.
Add Likely Subtags:
Given a source locale X, to return a locale Y where the empty subtags have been filled in by the most likely subtags.
This is written as X ⇒ Y ("X maximizes to Y").
A subtag is called
empty
if it is a missing script or region subtag, or it is a base language subtag with the value "und". In the description below, a subscript on a subtag
indicates which tag it is from:
xs
is in the source,
xm
is in a match, and
xr
is in the final result.
This operation is performed in the following way.
Canonicalize.
Make sure the input locale is in canonical form: uses the right separator, and has the right casing.
Replace any deprecated subtags with their canonical values using the

data in supplemental metadata. Use the first value in the replacement list, if it exists.
Language tag replacements may have multiple parts, such as "sh" ➞ "sr_Latn" or "mo" ➞ "ro_MD". In such a case, the original script and/or region are retained if there is
one. Thus "sh_Arab_AQ" ➞ "sr_Arab_AQ", not "sr_Latn_AQ".
There are certain exceptions to this: some implementations still use three obsolete language subtags: iw, in, and yi.
The likely subtags data currently supports those implementations by providing elements that handle them,
with the deprecated code on both sides:

Such implementations may refrain from replacing those deprecated tags.
If the tag is a legacy language tag (marked as “Type: grandfathered” in BCP 47; see

in the supplemental data), then return it.
Remove the script code 'Zzzz' and the region code 'ZZ' if they occur.
Get the components of the cleaned-up source tag
(language
, script
and
region
), plus any variants and extensions.
If the language is not 'und' and the other two components are not empty, return the language tag composed of
language
_script
_region
+ variants + extensions.
Lookup.
Look up each of the following in order, and stop on the first match:
language
_script
_region
language
_script
language
_region
language
Return
If there is no match, signal an error and stop.
Otherwise there is a match =
language
_script
_region
Let x
= x
if x
is neither empty nor 'und', and x
otherwise.
Return the language tag composed of
language
_script
_region
+ variants + extensions.
Signalling an error can be done in various ways, depending on the most consistent approach for APIs in the module. For example:
raise an exception
return an error value (such as null)
return the input (with missing fields)
return the input, but "Zzzz", and/or "ZZ" substituted for empty fields.
"und"
One by-product of this algorithm is that an element such as

would be misleading: the 'fr' can never be replaced by 'en'.
The only subtags that can be replaced are deprecated ones, empty, und, Zzzz, and ZZ.
The lookup can be optimized. For example, if any of the tags in Step 2 are the same as previous ones in that list, they do not need to be tested.
Example1:
Input is ZH-ZZZZ-SG.
Normalize to zh_SG.
Look up in table. No match.
Look up zh, and get the match (zh_Hans_CN). Substitute SG, and return zh_Hans_SG.
To find the most likely language for a country, or language for a script, use "und" as the language subtag. For example, looking up "und_TW" returns zh_Hant_TW.
A general goal of the algorithm is that non-empty field present in the 'from' field is also present in the 'to' field, so a non-empty input field will not change in "Add Likely Subtags" operation.
That is, when X ⇒ Y, and X' results from replacing an empty subtag in X by the corresponding subtag in Y, then X' ⇒ Y.
For example, if und_AF ⇒ fa_Arab_AF, then:
fa_Arab_AF ⇒ fa_Arab_AF
und_Arab_AF ⇒ fa_Arab_AF
fa_AF ⇒ fa_Arab_AF
There are a few exceptions to this goal:
A 'denormalized' subtag changes to the normalized form, except for certain denormalized language subtags such as 'iw' (for 'he' = Hebrew) which may occur in both the 'from' and 'to' fields of the data.
This allows for implementations that use those denormalized subtags to use the data with only minor changes to the operations.
A macroregion (such as West Africa = 011)
may
change to a specific country (Nigeria = NG).
Remove
Likely Subtags:
Given a locale, remove any fields that Add Likely Subtags would add.
The reverse operation removes fields that could be added by the first operation.
First get max = AddLikelySubtags(inputLocale).
If an error is signaled in AddLikelySubtags, signal that same error and stop.
Remove the variants and extensions from max.
Get the components of the max (
languagemax
scriptmax
regionmax
).
Then for
trial
in {
languagemax
languagemax_regionmax
languagemax_scriptmax
If AddLikelySubtags(
trial
) = max, then return
trial
+ variants + extensions.
If there is no match, return max + variants + extensions.
Example:
Input is zh_Hant or zh_TW.
Maximize to get zh_Hant_TW.
zh => zh_Hans_CN. No match, so continue.
zh_TW => zh_Hant_TW. Matches, so return
zh_TW
Remove
Likely Subtags, favoring script:
Given a locale, remove any fields that Add Likely Subtags would add, but favor script over region.
A variant of this favors the script over the region, thus using {language, language_script, language_region} in the step #4 above.
This variant much less commonly used, only when the script relationship is more significant to users.
Here is the difference:
Example:
Input is zh_Hant or zh_TW.
Maximize to get zh_Hant_TW.
zh => zh_Hans_CN. No match, so continue.
zh_Hant => zh_Hant_TW. Matches, so return
zh_Hant
Language Matching

Implementers are often faced with the issue of how to match the user's requested languages with their product's supported languages. For example, suppose that a product supports {ja-JP, de, zh-TW}. If the user understands written American English, German, French, Swiss German, and Italian, then
de
would be the best match; if s/he understands only Chinese (zh), then zh-TW would be the best match.
The standard truncation-fallback algorithm does not work well when faced with the complexities of natural language. The language matching data is designed to fill that gap. Stated in those terms, language matching can have the effect of a more complex fallback, such as:
sr-Cyrl-RS
sr-Cyrl
sr-Latn-RS
sr-Latn
sr
hr-Latn
hr
Language matching is used to find the best supported locale ID given a requested list of languages. The requested list could come from different sources, such as the user's list of preferred languages in the OS Settings, or from a browser Accept-Language list. For example, if my native tongue is English, I can understand Swiss German and German, my French is rusty but usable, and Italian basic, ideally an implementation would allow me to select {gsw, de, fr} as my preferred list of languages, skipping Italian because my comprehension is not good enough for arbitrary content.
Language Matching can also be used to get fallback data elements. In many cases, there may not be full data for a particular locale. For example, for a Breton speaker, the best fallback if data is unavailable might be French. That is, suppose we have found a Breton bundle, but it does not contain translation for the key "CN" (for the country China). It is best to return "chine", rather than falling back to the value default language such as Russian and getting "Китай". The language matching data can be used to get the closest fallback locales (of those supported) to a given language.
For the relationship between Inheritance, DefaultContent, LikelySubtags, and LocaleMatching, see
Inheritance vs Related Information
When such fallback is used for inherited item lookup, the normal order of inheritance is used for inherited item lookup, except that before using any data from
root
, the data for the fallback locales would be used if available. Language matching does not interact with the fallback of resources
within the locale-parent chain
. For example, suppose that we are looking for the value for a particular path
in
nb-NO
. In the absence of aliases, normally the following lookup is used.
nb-NO
nb
root
That is, we first look in
nb-NO
. If there is no value for
there, then we look in
nb
. If there is no value for
there, we return the value for
in root (or a code value, if there is nothing there). Remember that if there is an
alias
element along this path, then the lookup may restart with a different path in
nb-NO
(or another locale).
However, suppose that
nb-NO
has the fallback values
[nn da sv en]
, derived from language matching. In that case, an implementation
may
progressively look up each of the listed locales, with the appropriate substitutions, returning the first value that is not found in
root
. This follows roughly the following pseudocode:
value = lookup(P, nb-NO); if (locationFound != root) return value;
value = lookup(P, nn-NO); if (locationFound != root) return value;
value = lookup(P, da-NO); if (locationFound != root) return value;
value = lookup(P, sv-NO); if (locationFound != root) return value;
value = lookup(P, en-NO); return value;
The locales in the fallback list are not used recursively. For example, for the lookup of a path in nb-NO, if
fr
were a fallback value for
da
, it would not matter for the above process. Only the original language matters.
The language matching data is intended to be used according to the following algorithm. This is a logical description, and can be optimized for production in many ways. In this algorithm, the languageMatching data is interpreted as an ordered list.
Distances between given pair of subtags can be larger or smaller than the typical distances. For example, the distance between en and en-GB can be greater than those between en-GB and en-IE. In some cases, language and/or script differences can be as small as the typical region difference. (Example: sr-Latn vs. sr-Cyrl).
The distances resulting from the table are not linear, but are rather chosen to produce expected results. So a distance of 10 is not necessarily twice as "bad" as a distance of 5. Implementations may want to have a mode where script distances should swamp language distances. The tables are built such that this can be accomplished by multiplying the language distance by 0.25.
The language matching algorithm takes a list of a user’s desired languages, and a list of the application’s supported languages.
Set the best weighted distance BWD to ∞
Set the best desired language BD to null
Set the best supported language BS to null
For each desired language D
Compute a demotion value F, based on the position in the list.
This demotion value is up to the implementation, but is typically a positive value that increases according to how far D is from the start of the desired language list.
For each supported language S
Find the matching distance MD as described below.
Compute the weighted distance as F + MD
If WD < BD
BWD = WD
BD = D
BS = S
If the BWD is less than a threshold, return
The threshold is implementation-defined, typically set to greater than a default region difference, and less than a default script difference.
Otherwise BD = the default supported language (like English); return
To find the matching distance MD between any two languages, perform the following steps.
Maximize each language using
Likely Subtags
und is a special case: see below.
Set the match-distance MD to 0
For each subtag in {language, script, region}
If respective subtags in each language tag are identical, remove the subtag from each (logically) and continue.
Traverse the languageMatching data until a match is found.
* matches any field.
If the oneway flag is false, then the match is symmetric; otherwise only match one direction.
For region matching, use the mechanisms in
Enhanced Language Matching
Add the
distance
attribute value to MD.
This used to be a
percent
attribute value, which was 100 - the
distance
attribute value.
Remove the subtag from each (logically)
Return MD
It is typically useful to set the discount factor between successive elements of the desired languages list to be slightly greater than the default region difference. That avoids the following problem:
Supported languages:
"de, fr, ja"
User's desired languages:
"de-AT, fr"
This user would expect to get "de", not "fr". In practice, when a user selects a list of preferred languages, they don't include all the regional variants ahead of their second base language. Yet while the user's desired languages really doesn't tell us the priority ranking among their languages, normally the fall-off between the user's languages is substantially greater than regional variants. But unless F is greater than the distance between de-AT and de-DE, then the user’s second-choice language would be returned.
The base language subtag "und" is a special case. Suppose we have the following situation:
desired languages: {und, it}
supported languages: {en, it}
resulting language: en
Part of this is because 'und' has a special function in BCP 47; it stands in for 'no supplied base language'. To prevent this from happening, if the desired base language is und, the language matcher should not apply likely subtags to it.
Examples:
For example, suppose that nn-DE and nb-FR are being compared. They are first maximized to nn-Latn-DE and nb-Latn-FR, respectively. The list is searched. The first match is with "*-*-*", for a match of 96%. The languages are truncated to nn-Latn and nb-Latn, then to nn and nb. The first match is also for a value of 96%, so the result is 92%.
Note that language matching is orthogonal to the how closely two languages are related linguistically. For example, Breton is more closely related to Welsh than to French, but French is the better match (because it is more likely that a Breton reader will understand French than Welsh). This also illustrates that the matches are often asymmetric: it is not likely that a French reader will understand Breton.
The "*" acts as a wild card, as shown in the following example:

When the language+region is not matched, and there is otherwise no reason to pick among the supported regions for that language, then some measure of geographic "closeness" can be used. The results may be more understandable by users. Looking for en-SK, for example, should fall back to something within Europe (eg en-GB) in preference to something far away and unrelated (eg en-SG). Such a closeness metric does not need to be exact; a small amount of data can be used to give an approximate distance between any two regions. However, any such data must be used carefully; although Hong Kong is closer to India than to the UK, it is unlikely that en-IN would be a better match to en-HK than en-GB would.
Enhanced Language Matching
The enhanced format for language matching adds structure to enable better matching of languages. It is distinguished by having a suffix "_new" on the type, as in the example below. The extended structure allows matching to take into account broad similarities that would give better results. For example, for English the regions that are or inherit from US (AS|GU|MH|MP|PR|UM|VI|US) form a “cluster”. Each region in that cluster should be closer to each other than to any other region. And a region outside the cluster should be closer to another region outside that cluster than to one inside. We get this issue with the “world languages” like English, Spanish, Portuguese, Arabic, etc.
Example:

The
matchVariable
allows for a rule to match to multiple regions, as illustrated by
$maghreb
. The syntax is simple: it allows for + for
union
and - for
set difference
, but no precedence. So A+B-A+D is interpreted as (((A+B)-A)+D), not as (A+B)-(A+D). The variable
id
has a value of the form [$][a-zA-Z0-9]+. If $X is defined, then $!X automatically means all those regions that are not in $X.
When the set is interpreted, then macrolanguages are (logically) transformed into a list of their contents, so “053+GB” → “AU+GB+NF+NZ”. This is done recursively, so 009 → “053+054+057+061+QO” → “AU+NF+NZ+FJ+NC+PG+SB +VU...”. Note that we use 019 for all of the Americas in the variables above, because en-US should be in the same cluster as es-419 and its contents.
In the rules, the percent value (100..0) is replaced by a
distance
value, which is the inverse (0..100).
These new variables and rules divide up the world into clusters, where items in the same clusters (for specific languages) get the normal regional difference, and items in different clusters get different weights.
Each cluster can have one or more associated
paradigmLocales
. These are locales that are preferred within a cluster. So when matching desired=[en-SA] against [en-GU en en-IN en-GB], the value en-GB is returned. Both of {en-GU en} are in a different cluster. While {en-IN en-GB} are in the same cluster, and the same distance from en-SA, the preference is given to en-GB because it is in the paradigm locales. It would be possible to express this in rules, but using this mechanism handles these very common cases without bulking up the tables.
The
paradigmLocales
also allow matching to macroregions. For example, desired=[es-419] should match to {es-MX} more closely than to {es}, and vice versa: {es-MX} should match more closely to {es-419} than to {es}. But es-MX should match more closely to es-419 than to any of the other es-419 sublocales. In general, in the absence of other distance data, there is a ‘paradigm’ in each cluster that the others should match more closely to: en(-US), en-GB, es(-ES), es-419, ru(-RU)...
XML Format
There are two kinds of data that can be expressed in LDML: language-dependent data and supplementary data. In either case, data can be split across multiple files, which can be in multiple directory trees.
For example, the language-dependent data for Japanese in CLDR is present in the following files:
common/collation/ja.xml
common/main/ja.xml
common/rbnf/ja.xml
common/segmentations/ja.xml
Data for cased languages such as French are in files like:
common/casing/fr.xml
The status of the data is the same, whether or not data is split. That is, for the purpose of validation and lookup, all of the data for the above ja.xml files is treated as if it was in a single file. These files have the

root element and use ldml.dtd. The file name must match the identity element. For example, the

file pa_Arab_PK.xml must contain the following elements: