Glossary
Glossary
Glossary of Unicode Terms
P-Q
X-Y
This glossary is updated periodically to stay synchronized with changes to various standards maintained by the Unicode Consortium.
See
About
Unicode Terminology
for translations of various terms. There is also
an
FAQ
section on the website.
Abjad
. A writing system in which only
consonants are indicated. The term “abjad” is derived from the first
four letters of the traditional order of the Arabic script:
alef,
beh, jeem, dal
. (See
Section 6.1, Writing Systems
.)
Abstract Character
A unit of information used for the organization, control, or
representation of textual data. (See definition D7 in
Section 3.4,
Characters and Encoding
.)
Abstract Character Sequence
An ordered sequence of one or more abstract characters. (See
definition D8 in
Section 3.4,
Characters and Encoding
.)
Abugida
. A writing system in which
consonants are indicated by the base letters that have an inherent
vowel, and in which other vowels are indicated by additional
distinguishing marks of some kind modifying the base letter. The
term “abugida” is derived from the first four letters of the
Ethiopic script in the Semitic order:
alf, bet, gaml, dant
. (See
Section 6.1, Writing Systems
.)
Accent Mark
. A
mark placed above, below, or to the side of a character to alter its
phonetic value. (See also
diacritic
.)
Acrophonic
. Denoting letters or numbers by the first letter of their
name. For example, the Greek acrophonic numerals are variant forms
of such initial letters.
Aksara
. (1) In Sanskrit grammar, the term for “letter” in general,
as opposed to consonant (
vyanjana
) or vowel (
svara
). Derived from
the first and last letters of the traditional ordering of Sanskrit
letters—“a” and “ksha”. (2) More generally, in Indic writing
systems,
aksara
refers to an
orthographic syllable
Algorithm
. A term used in a broad sense in the Unicode Standard, to
mean the logical description of a process used to achieve a
specified result. This does not require the actual procedure
described in the algorithm to be followed; any implementation is
conformant as long as the results are the same.
Alphabet
. A writing system in which both consonants and vowels are
indicated. The term “alphabet” is derived from the first two letters
of the Greek script:
alpha, beta
. (See
Section 6.1, Writing Systems
.)
Alphabetic Property
. Informative property of the primary units of
alphabets and/or syllabaries. (See
Section 4.10, Letters, Alphabetic, and Ideographic
.)
Alphabetic Sorting
. (See
collation
.)
AMTRA
. Acronym for
Arabic Mark Transient Reordering Algorithm
. (See
Unicode Standard Annex #53, “Unicode Arabic Mark Rendering.”
Annotation
. The association of secondary textual content with a
point or range of the primary text. (The value of a particular
annotation
is
considered to be a part of the “content” of the text.
Typical examples include glossing, citations, exemplification,
Japanese yomi, and so on.)
ANSI
. (1) The American National Standards Institute. (2) The
Microsoft collective name for all Windows code pages. Sometimes used
specifically for code page 1252, which is a superset of ISO/IEC
8859-1.
Apparatus Criticus
. Collection of conventions used by editors to
annotate and comment on text.
Arabic Digits
. The term "Arabic digits"
may mean either the digits in the Arabic script (see
Arabic-Indic digits
) or the
ordinary ASCII digits in contrast to Roman numerals (see
European digits
). When the term
"Arabic digits" is used in Unicode specifications, it means
Arabic-Indic digits. See
Terminology for Digits
for additional information on terminology related to digits.
Arabic-Indic Digits
Forms of decimal digits used in most parts of the
Arabic world (for instance, U+0660, U+0661, U+0662, U+0663). Although
European digits
(1, 2, 3,…)
derive historically from these forms, they are visually distinct and
are coded separately. (Arabic-Indic digits are sometimes called Indic
numerals; however, this nomenclature leads to confusion with the
digits currently used with the scripts of India.) Variant forms of Arabic-Indic digits used
chiefly in Iran and Pakistan are referred to as
Eastern Arabic-Indic
digits
. (See
Section 9.2, Arabic
.) See
Terminology for Digits
for additional information on terminology related to digits.
ASCII
. (1) The American Standard Code for Information Interchange, a
7-bit coded character set for information interchange. It is the
U.S. national variant of ISO/IEC 646 and is formally the U.S.
standard ANSI X3.4. It was proposed by ANSI in 1963 and finalized in
1968. (2) The set of 128 Unicode characters from U+0000 to U+007F,
including control codes as well as graphic characters. (3) ASCII has
been incorrectly used to refer to various 8-bit character encodings
that include ASCII characters in the first 128 code points.
ASCII digits
. The digit characters U+0030 to U+0039. Also known as
European digits
. See
Terminology for Digits
for additional information on terminology related to digits.
Assigned Character
. A code point that is assigned to an abstract character.
This refers to graphic, format, control, and private-use characters
that have been encoded in the Unicode Standard. (See
Section 2.4, Code Points and Characters
.)
Assigned Code Point
. (See
designated code point
.)
Atomic Character
. A character that is not decomposable. (See
decomposable character
.)
Base Character
. Any graphic
character except for those with the General Category of Combining
Mark (M). (See definition D51 in
Section 3.6, Combination
.) In a
combining character sequence, the base character is the initial
character, which the combining marks are applied to.
Basic Multilingual Plane
Plane 0, abbreviated as BMP.
Bicameral
. A script that
distinguishes between two cases. (See
case
.)
Most often used in the context of Latin-based alphabets of Europe
and elsewhere in the world.
Bidi
. Abbreviation of bidirectional,
in reference to mixed left-to-right and right-to-left text.
Bidirectional Display
The process or result of mixing left-to-right text and right-to-left
text in a single line. (See
Unicode Standard
Annex #9, “Unicode Bidirectional Algorithm.”
Big-endian
. A computer
architecture that stores multiple-byte numerical values with the
most significant byte (MSB) values first.
Binary Files
. Files containing
nontextual information.
Block
. A grouping of characters
within the Unicode encoding space used for organizing code charts. Each block is a uniquely named, continuous, non-overlapping range of code points, containing a multiple of 16 code points, and starting at a location that is a multiple of 16. A block may contain unassigned
code points, which are reserved.
BMP
. Acronym for
Basic Multilingual
Plane
BMP Character
A Unicode encoded
character having a BMP code point. (See
supplementary character
.)
BMP Code Point
. A Unicode code
point between U+0000 and U+FFFF. (See
supplementary code point
.)
BNF
. Acronym for
Backus-Naur Form
a formal meta-syntax for describing context-free syntaxes. (For
details, see
Appendix A, Notational Conventions
.)
BOCU-1
. Acronym for Binary Ordered
Compression for Unicode. A Unicode compression scheme that is
MIME-compatible (directly usable for e-mail) and preserves binary
order, which is useful for databases and sorted lists.
BOM
. Acronym for
byte order mark
Bopomofo
. An alphabetic script used
primarily in the Republic of China (Taiwan) to write the sounds of
Mandarin Chinese and some other dialects. Each symbol corresponds to
either the syllable-initial or syllable-final sounds; it is
therefore a subsyllabic script in its primary usage. The name is
derived from the names of its first four elements. More properly
known as
zhuyin zimu
or
zhuyin fuhao
in Mandarin
Chinese.
Boustrophedon
. A pattern of
writing seen in some ancient manuscripts and inscriptions, where
alternate lines of text are laid out in opposite directions, and
where right-to-left lines generally use glyphs mirrored from their
left-to-right forms. Literally, “as the ox turns,” referring to the
plowing of a field.
Braille
. A writing system using a
series of raised dots to be read with the fingers by people who are
blind or whose eyesight is not sufficient for reading printed
material. (See
Section 21.1, Braille
.)
Braille Pattern
. One of the
64 (for six-dot Braille) or 256 (for eight-dot Braille) possible
tangible dot combinations.
Byte
. (1) The minimal unit of
addressable storage for a particular computer architecture. (2) An
octet. Note that many early computer architectures used bytes larger
than 8 bits in size, but the industry has now standardized almost
uniformly on 8-bit bytes. The Unicode Standard follows the current
industry practice in equating the term
byte
with
octet
and using the more familiar term
byte
in all contexts. (See
octet
.)
Byte Order Mark
. The Unicode
character U+FEFF when used to indicate the byte order of a text.
(See
Section 2.13, Special Characters and Noncharacters
, and
Section 23.8, Specials
.)
Byte Serialization
. The
order of a series of bytes determined by a computer architecture.
Byte-Swapped
. Reversal of the
order of a sequence of bytes.
Camelcase
. A casing convention for compound terms or identifiers, in which the letters are mostly lowercased, but component words or abbreviations may be capitalized. For example, "ThreeWordTerm" or "threeWordTerm".
Canonical
. (1) Conforming to the
general rules for encoding—that is, not compressed, compacted, or in
any other form specified by a higher protocol. (2) Characteristic of
a normative mapping and form of equivalence specified in
Chapter
3, Conformance
Canonical Composition
. A step in the algorithm for Unicode Normalization Forms, during which decomposed sequences are replaced by primary composites, where possible. (See definition D115 in
Section 3.11, Normalization Forms
.)
Canonical Decomposable Character
A character that is not identical to its canonical decomposition.
(See definition D69 in
Section 3.7, Decomposition
.)
Canonical Decomposition
Mapping to an inherently equivalent sequence—for example, mapping ä
to a + combining umlaut. (For a full, formal definition, see definition D68 in
Section 3.7, Decomposition
.)
Canonical Equivalence
The relation between two character sequences whose
full canonical decompositions are identical. (See definition
D70 in
Section 3.7, Decomposition
.)
Canonical Equivalent
Two character sequences are said to be canonical equivalents if
their full canonical decompositions are identical. (See definition
D70 in
Section 3.7, Decomposition
.)
Canonical Ordering
. The
order of a
combining character sequence
that results from the application of
the Canonical Ordering Algorithm, a step in the process of
normalization
of strings.
See definition
D109 in Section 3.11, Normalization Forms
Cantillation Mark
. A mark
that is used to indicate how a text is to be chanted or sung.
Capital Letter
Synonym for
uppercase letter
. (See
case
.)
Case
. (1) Feature of certain
alphabets where the letters have two distinct forms. These variants,
which may differ markedly in shape and size, are called the
uppercase
letter (also known as
capital
or
majuscule
) and the
lowercase
letter (also known as
small
or
minuscule
). (2) Normative
property of characters, consisting of uppercase, lowercase, and titlecase (Lu, Ll, and Lt). (See
Section 4.2, Case
.)
Case Folding
. The mapping of
strings to a particular case form, to facilitate searching and sorting of text.
Case foldings may be simple, when the case mappings are required not
to change the length of the strings to compare, or full, when the case mappings
may change the length of the strings to compare. (See
Section 3.13.3, Default Case Folding
.)
Case Mapping
. The association of
the uppercase, lowercase, and titlecase forms of a letter. (See
Section 5.18, Case Mappings
.)
Case-Ignorable
. A character C is defined to be
case-ignorable
if C has the value MidLetter (ML), MidNumLet (MB), or Single_Quote (SQ) for the Word_Break property or its General_Category is one of Nonspacing_Mark (Mn), Enclosing_Mark (Me), Format (Cf), Modifier_Letter (Lm), or Modifier_Symbol (Sk). (See definition D136 in
Section 3.13, Default Case Algorithms
.)
Case-Ignorable Sequence
. A sequence of zero or more case-ignorable
characters. (See definition D137 in
Section 3.13, Default Case Algorithms
.)
CCC
. Short name for the Canonical_Combining_Class property,
usually lowercased: ccc.
CCS
. (1) Acronym for
coded character set
. (2) Also used as an acronym for
combining character sequence
Cedilla
. A mark originally placed
beneath the letter c in French, Portuguese, and Spanish to indicate
that the letter is to be pronounced as an
s,
as in
façade
Obsolete Spanish diminutive of
ceda
, the letter
CEF
. Acronym for
character encoding form
CES
. Acronym for
character encoding scheme
Character
. (1) The smallest
component of written language that has semantic value; refers to the
abstract meaning and/or shape, rather than a specific shape (see
also
glyph
), though in code tables some form of visual
representation is essential for the reader’s understanding. (2)
Synonym for
abstract character
. (3) The basic unit of
encoding for the Unicode character encoding. (4) The English name
for the ideographic written elements of Chinese origin. [See
ideograph
(2).]
Character Block
. (See
block
.)
Character Class
. A set of characters sharing a particular set of
properties.
Character Encoding Form
Mapping from a character set definition to the actual code units
used to represent the data.
Character Encoding Scheme
character encoding form
plus byte serialization. There are
seven character encoding schemes in Unicode: UTF-8, UTF-16,
UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE.
Character Entity
. Expression of the form & for "&" or for the no-break space. These are found in markup language files like HTML or XML. There are also numerically defined character entities. (See also
character escape
.)
Character Escape
. A numerical expression of the form \uXXXX, \xXXXX or XXXX; where X is a hex digit, or dddd; where d is a decimal digit. These are found in programming source code or markup language files (such as HTML or XML).
Character Name
. A unique string
used to identify each abstract character encoded in the standard.
(See definition D4 in
Section 3.3, Semantics
.)
Character Name Alias
. An additional unique string identifier, other
than the character name, associated with an encoded character in the
standard. (See definition D5 in
Section 3.3, Semantics
.)
Character Properties
. A
set of property names and property values associated with individual
characters. (See
Chapter
4, Character Properties
.)
Character Repertoire
The collection of characters included in a character set.
Character Sequence
Synonym for
abstract character sequence
Character Set
A collection of elements used to represent textual information.
Charset
. (See
coded character set
.)
Chillu
. Abbreviation for
chilaaksharam
(singular) (
cillakṣaram
).
Refers to any of a set of sonorant consonants in Malayalam, when
appearing in syllable-final position with no inherent vowel.
Choseong
. A sequence of one or more leading consonants in Korean.
Chu Hán
. The name for Han characters used in Vietnam; derived from
hànzì
Chu Nôm
. A demotic script of Vietnam
developed from components of Han characters. Its creators used
methods similar to those used by the Chinese in creating Han
characters.
CJK
. Acronym for Chinese, Japanese, and
Korean. A variant,
CJKV
means Chinese, Japanese, Korean, and Vietnamese.
CJK Unified Ideograph
A Han character that has undergone the process of Han unification (conducted primarily by
the Ideographic Research Group) and been encoded as a single ideograph with one or more
clearly identified CJK source mappings. CJK unified ideographs have no decomposition
mappings, and the set of them in the Unicode Standard is normatively specified by
the Unified_Ideograph property.
CLDR
(See
Unicode Common Locale Data Repository
.)
Coded Character
. (See
encoded character
.)
Coded Character Representation
. Synonym for
coded character
sequence
Coded Character Sequence
An ordered sequence of one or more code points. Normally, this
consists of a sequence of encoded characters, but it may also
include noncharacters or reserved code points. (See
definition D12 in
Section 3.4,
Characters and Encoding
.)
Coded Character Set
. A
character set in which each character is assigned a numeric code
point. Frequently abbreviated as
character set, charset
, or
code set
; the acronym
CCS
is also used.
Code Page
. A coded character set,
often referring to a coded character set used by a personal
computer—for example, PC code page 437, the default coded character
set used by the U.S. English version of the DOS operating system.
Code Point
. (1) Any value in the
Unicode codespace; that is, the range
of integers from 0 to 10FFFF
16
. (See definition D10 in
Section 3.4,
Characters and Encoding
.) Not all code points are assigned to encoded characters. See
code point type
. (2) A value, or position, for a
character, in any coded character set.
Code Point Type
. Any of the seven fundamental classes of code points in the standard: Graphic, Format, Control, Private-Use, Surrogate, Noncharacter, Reserved.
(See definition D10a in
Section 3.4,
Characters and Encoding
.)
Code Position
. Synonym for
code point
. Used in ISO character encoding standards.
Code Set
. (See
coded character set
.)
Codespace
. (1) A range of numerical values available for encoding
characters. (2) For the Unicode Standard, a range of integers from 0
to 10FFFF
16
. (See definition D9 in
Section 3.4,
Characters and Encoding
.)
Code Unit
. The minimal bit combination that can represent a unit of
encoded text for processing or interchange. The Unicode Standard
uses 8-bit code units in the UTF-8 encoding form, 16-bit code units
in the UTF-16 encoding form, and 32-bit code units in the UTF-32
encoding form. (See definition D77 in
Section 3.9, Unicode Encoding Forms
.)
Code Value
. Obsolete synonym for
code unit
Codomain
For a mapping, the codomain is the set of code points or sequences that it maps to, while the domain
is the set of values that are mapped. For example, a canonical decomposition is a mapping from a
set of code points to a set of sequences; the codomain is the set of canonical equivalent mappings. (See also
domain
.)
Collation
. The process of
ordering units of textual information. Collation is usually specific
to a particular language. Also known as
alphabetizing
or
alphabetic sorting
Unicode Technical
Standard #10, “Unicode Collation Algorithm,"
defines a complete, unambiguous, specified
ordering for all characters in the Unicode Standard.
Combining Character
. A
character with the General Category of Combining Mark (M). (See
definition D52 in
Section 3.6, Combination
.) (See also
nonspacing mark
.)
Combining Character Sequence
. A maximal character sequence consisting of either
a base character followed by a sequence of one or more characters
where each is a combining character,
zero width joiner
, or
zero width non-joiner
or a sequence of one or more characters where each is a combining
character,
zero width joiner
or
zero width non-joiner
(See definition D56 in
Section 3.6, Combination
.)
Combining Class
. A numeric
value in the range 0..254 given to each Unicode code point, formally
defined as the property Canonical_Combining_Class. (See definition
D104 in
Section 3.11, Normalization Forms
.)
Combining Mark
. A commonly used synonym for
combining character
Compatibility
. (1) Consistency with existing practice or preexisting character encoding standards. (2) Characteristic of a normative mapping and form of equivalence specified in
Section 3.7, Decomposition
Compatibility Character
A character that would not have been
encoded except for compatibility and round-trip convertibility with
other standards. (See
Section 2.3, Compatibility Characters
.)
Compatibility Composite Character
. Synonym for
compatibility
decomposable character
Compatibility Decomposable Character
A character whose compatibility decomposition is not identical to
its canonical decomposition. (See definition D66 in
Section 3.7, Decomposition
.)
Compatibility Decomposition
. Mapping to a roughly equivalent sequence
that may differ in style. (For a full, formal definition, see
definition D65 in
Section 3.7, Decomposition
.)
Compatibility Equivalence
The relation between two character sequences whose
full compatibility decompositions are identical. (See definition
D67 in
Section 3.7, Decomposition
.)
Compatibility Equivalent
Two character sequences are said to be compatibility equivalents if
their full compatibility decompositions are identical. (See definition
D67 in
Section 3.7, Decomposition
.)
Compatibility Ideograph
. A Han character encoded for compatibility with some East Asian character encoding, but which is
not encoded as a
CJK unified ideograph
. Instead, each
compatibility ideograph has a
canonical
decomposition
mapping to a particular CJK unified ideograph.
Compatibility Precomposed Character
. Synonym for
compatibility
decomposable character
Compatibility Variant
. A character that generally can be remapped to
another character without loss of information other than formatting.
Composite Character
. (See
decomposable character
.)
Composite Character Sequence
. (See
combining character sequence
.)
Composition Exclusion
. A Canonical Decomposable Character which has the property value Composition_Exclusion=True. (Used in the definition of Unicode Normalization Forms.) (See
definition D112 in
Section 3.11, Normalization Forms
.)
Conformance
. Adherence to a specified set of criteria for use of a
standard. (See
Chapter
3, Conformance
.)
Confusable
. Of similar or identical appearance.
When referring to characters in strings, the appearance of confusable characters
can make different identifiers hard or impossible to distinguish. (See also
Unicode Technical Standard #39, "Unicode Security Mechanisms"
.)
Conjunct Form
. A ligated form representing a
consonant conjunct
Consonant Cluster
. A sequence of two or more consonantal sounds.
Depending on the writing system, a consonant cluster may be
represented by a single character or by a sequence of characters.
(Contrast
digraph
.)
Consonant Conjunct
. A sequence of two or more adjacent consonantal
letterforms, consisting of a sequence of one or more dead consonants
followed by a normal, live consonant letter. A consonant conjunct
may be ligated into a single conjunct form, or it may be represented
by graphically separable parts, such as subscripted forms of the
consonant letters. Consonant conjuncts are associated with the
Brahmi family of Indic scripts. (See
Section 12.1, Devanagari
.)
Contextual Variant
. A text element can have a presentation form that
depends on the textual context in which it is rendered. This
presentation form is known as a
contextual variant
Contributory Property
. A simple property defined merely to make the statement of a rule defining a derived property more compact or general.
(See definition D35a in
Section 3.5, Properties
.)
Control Codes
. The 65 characters in the ranges U+0000..U+001F and
U+007F..U+009F. Also known as
control characters
Core Specification
. The central part of the Unicode Standard–the portion which up until Version 5.0 was published as a separate book. Starting with Version 5.2, this part of the standard has been published online only, rather than as a book. The core specification consists of the general introduction and framework for the standard, the formal conformance requirements, many implementation guidelines, and extensive chapters providing information about all the encoded characters, organized by script or by significant classes of characters. Formally, a version of the Unicode Standard is defined by an edition of this core specification, together with the
Code Charts
Unicode Standard Annexes
, and the
Unicode Character Database
Cursive
. Writing where the letters of a word are connected.
Dasia
. Greek term for rough breathing mark, used in polytonic Greek character names.
DBCS
. Acronym for
double-byte character set
Dead Consonant
. An Indic
consonant character followed by a
virama
character. This
sequence indicates that the consonant has lost its inherent vowel.
(See
Section 12.1, Devanagari
.)
Decimal Digits
. Digits that
can be used to form decimal-radix numbers.
Decomposable Character
A character that is equivalent to a sequence of one or more other
characters, according to the decomposition mappings found in the
Unicode Character Database, and those described in
Section 3.12, Conjoining Jamo Behavior
. It may also be known as a
precomposed
character or a
composite
character. (See definition D63 in
Section 3.7, Decomposition
.)
Decomposition
. (1) The process
of separating or analyzing a text element into component units.
These component units may not have any functional status, but may be
simply formal units—that is, abstract shapes. (2) A sequence of one
or more characters that is equivalent to a decomposable character.
(See definition D64 in
Section 3.7, Decomposition
.)
Decomposition Mapping
. A
mapping from a character to a sequence of one or more characters
that is a canonical or compatibility equivalent and that is listed
in the character names list or described in
Section 3.12, Conjoining Jamo Behavior
. (See definition D62 in
Section 3.7, Decomposition
.)
Default Ignorable
. Default ignorable code points are those that should be ignored by default in rendering unless explicitly supported. They have no visible glyph or advance width in and of themselves, although they may affect the display, positioning, or adornment of adjacent or surrounding characters.
(See
Section 5.21, Ignoring Characters in Processing
.)
Defective Combining Character Sequence
. A combining character
sequence that does not start with a base character. (See definition
D57 in
Section 3.6, Combination
.)
Demotic Script
. (1) A script
or a form of a script used to write the vernacular or common speech
of some language community. (2) A simplified form of the ancient
Egyptian hieratic writing.
Dependent Vowel
. A symbol or
sign that represents a vowel and that is attached or combined with
another symbol, usually one that represents a consonant. For
example, in writing systems based on Arabic, Hebrew, and Indic
scripts, vowels are normally represented as dependent vowel signs.
Deprecated
. Of a coded character
or a character property, strongly discouraged from use. (Not the
same as
obsolete
.)
Deprecated Character
. A
coded character whose use is strongly discouraged. Such characters
are retained in the standard, indefinitely but should not be used. (See
definition D13 in
Section 3.4,
Characters and Encoding
.)
Designated Code Point
Any code point that has either been assigned to an abstract
character (
assigned characters
) or that has otherwise been
given a normative function by the standard (surrogate code points
and noncharacters). This definition excludes reserved code points.
Also known as
assigned code point
. (See
Section 2.4 Code Points and Characters
.)
Deterministic Comparison
. A string comparison in which strings that do not have identical contents will compare as unequal. There are two main varieties, depending on the sense of "identical:" (a) binary equality, or (b) canonical equivalence. This is a property of the comparison mechanism, and not of the sorting algorithm. Also known as
stable
(or
semi-stable
comparison
Deterministic Sort
. A sort algorithm which returns exactly the same output each time it is applied to the same input. This is a property of the sorting algorithm, and not of the comparison mechanism. For example, a randomized Quicksort (which picks a random element as the pivot element, for optimal performance) is not deterministic. Multiprocessor implementations of a sort algorithm may also not be deterministic.
Diacritic
. (1) A mark applied or
attached to a symbol to create a new symbol that represents a
modified or new value. (2) A mark applied to a symbol irrespective
of whether it changes the value of that symbol. In the latter case,
the diacritic usually represents an independent value (for example,
an accent, tone, or some other linguistic information). Also called
diacritical mark
or
diacritical
. (See also
combining character
and
nonspacing mark
.)
Diaeresis
. Two horizontal dots
over a letter, as in
naïve
. The diaeresis is not
distinguished from the
umlaut
in the Unicode character
encoding. (See
umlaut
.)
Dialytika
. Greek term for
diaeresis
or
trema
, used in Greek character names.
Digits
. (See
Arabic digits
European digits
, and
Indic digits
.) See
Terminology for Digits
for additional information on terminology related to digits.
Digraph
. A pair of signs or symbols
(two graphs), which together represent a single sound or a single
linguistic unit. The English writing system employs many digraphs
(for example,
th, ch, sh, qu,
and so on). The same two
symbols may not always be interpreted as a digraph (for example,
ca
th
ode
versus
ca
th
ouse
). When three signs
are so combined, they are called a
trigraph
. More than three
are usually called an
n-graph
Dingbats
. Typographical symbols and
ornaments.
Diphthong
. A pair of vowels that
are considered a single vowel for the purpose of phonemic
distinction. One of the two vowels is more prominent than the other.
In writing systems, diphthongs are sometimes written with one symbol
and sometimes with more than one symbol (for example, with a
digraph
).
Direction
. (See
paragraph direction
.)
Directionality Property
A property of every graphic character that determines its horizontal
ordering as specified in
Unicode Standard
Annex #9, “Unicode Bidirectional Algorithm.”
(See
Section 4.4,
Directionality
.)
Display Cell
. A rectangular
region on a display device within which one or more glyphs are
imaged.
Display Order
. The order of
glyphs presented in text rendering. (See
logical order
and
Section 2.2, Unicode Design Principles
.)
Domain
1. For a mapping, the domain is the set of code points or sequences that are mapped, while the codomain
is the set of values they are mapped to. For example, a canonical decomposition is a mapping from a
set of code points to a set of sequences; the domain is the entire Unicode codespace. (See also
codomain
.) 2. A realm of administrative autonomy, authority
or control in the Internet, identified by a domain name.
Domain Name
. The part of a network address that identifies it as belonging to a particular domain. (Oxford Languages definition.)
A domain name is a string of characters. The
rules for how Unicode characters can be used in domain names is the concern of
IDNA
and of
UTS #46, Unicode IDNA Compatibility Processing
Double-Byte Character Set
One of a number of character sets defined for representing Chinese,
Japanese, or Korean text (for example, JIS X 0208-1990). These
character sets are often encoded in such a way as to allow
double-byte character encodings to be mixed with single-byte
character encodings. Abbreviated
DBCS
. (See also
multibyte character set
.)
Ductility
. The ability of a cursive
font to stretch or compress the connective baseline to effect text
justification.
Dynamic Composition
Creation of composite forms such as accented letters or Hangul
syllables from a sequence of characters.
EBCDIC
. Acronym for Extended
Binary-Coded Decimal Interchange Code. A group of coded character
sets used on mainframes that consist of 8-bit coded characters.
EBCDIC coded character sets reserve the first 64 code points (x00
to x3F) for control codes, and reserve the range x41 to xFE for graphic characters. The English
alphabetic characters are in discontinuous segments with uppercase
at xC1 to xC9, xD1 to xD9, xE2 to xE9, and lowercase at x81 to x89,
x91 to x99, xA2 to xA9.
ECCS
Acronym for
extended combining character sequence
EGC
Acronym for
extended grapheme cluster
Embedding
. A concept relevant to
bidirectional behavior. (See
Unicode Standard
Annex #9, “Unicode Bidirectional Algorithm,”
for detailed terminology and definitions.)
Emoji
. (1) The Japanese word for "pictograph." (2) Certain pictographic and other symbols encoded in the Unicode Standard that are commonly given a colorful or playful presentation when displayed on devices. Many of the emoji in Unicode were originally encoded for compatibility with Japanese telephone symbol sets. (3) Colorful or playful symbols which are not encoded as characters but which are widely implemented as graphics. (See
pictograph
.)
Emoticon
. A symbol added to text to express emotional affect or reaction—for example, sadness, happiness, joking intent, sarcasm, and so forth. Emoticons are often expressed by a conventional kind of "ASCII art," using sequences
of punctuation and other symbols to portray likenesses of facial expressions. In Western contexts these are often turned sideways, as :-) to express a happy face; in East Asian contexts other conventions often portray a facial expression without turning, as ^-^. Rendering systems often recognize conventional emoticon sequences and display them as colorful or even animated glyphs in text. There is also a set of dedicated pictographic symbols—mostly representing different facial expressions—encoded as characters in the Unicode Standard. (See
pictograph
.)
Encapsulated Text
. (1) Plain text surrounded by formatting information. (2) Text recoded to pass through narrow transmission channels or to match communication protocols.
Enclosing Mark
. A nonspacing mark with the General Category of
Enclosing Mark (Me). (See definition D54 in
Section 3.6, Combination
.) Enclosing marks are a subclass of nonspacing marks
that surround a base character, rather than merely being placed
over, under, or through it.
Encoded Character
. An
association (or mapping) between an
abstract character
and a
code point
. (See definition D11 in
Section 3.4,
Characters and Encoding
.) By itself, an abstract character has no numerical
value, but the process of “encoding a character” associates a
particular code point with a particular abstract character, thereby
resulting in an “encoded character.”
Encoding Form
. (See
character encoding form
.)
Encoding Scheme
. (See
character encoding scheme
.)
Equivalence
. In the context of
text processing, the process or result of establishing whether two
text elements are identical in some respect.
Equivalent Sequence
. (See
canonical equivalent
.)
Escape Sequence
. A sequence
of bytes that is used for code extension. The first byte in the
sequence is
escape
(hex 1B).
EUDC
. Acronym for end-user defined character. A character defined by
an end user, using a private-use code point, to represent a
character missing in a particular character encoding. These are
common in East Asian implementations.
European Digits
. Forms of
decimal digits first used in Europe and now used worldwide.
Historically, these digits were derived from the Arabic digits; they
are sometimes called “Arabic numerals,” but this nomenclature leads
to confusion with the real
Arabic-Indic digits
. Also called "Western digits" and "Latin digits." See
Terminology for Digits
for additional information on terminology related to digits.
Extended Base
Any base character, or any standard Korean syllable block. (See definition D51a in
Section 3.6, Combination
.)
Extended Combining Character Sequence
A maximal character sequence consisting of either an extended base followed by a sequence of one or more characters where each is a combining character,
zero width joiner
, or
zero width non-joiner
; or a sequence of one or more characters where each is a combining character,
zero width joiner
, or
zero width non-joiner
. Abbreviated as
ECCS
. (See definition D56a in
Section 3.6, Combination
.)
Extended Grapheme Cluster
The text between extended grapheme cluster boundaries as specified by
Unicode Standard Annex #29, "Unicode Text Segmentation."
Abbreviated as
EGC
. (See definition D61 in
Section 3.6, Combination
.)
Fancy Text
. (See
rich text
.)
Fixed Position Class
. A subset of the range of numeric values for
combining classes—specifically, any value in the range 10..199. (See
definition D105 in
Section 3.11, Normalization Forms
.)
Floating
diacritic, accent, mark
). (See
nonspacing mark
.)
Folding
. An operation that maps similar characters to a common
target, such as uppercasing or lowercasing a string. Folding
operations are most often used to temporarily ignore certain
distinctions between characters.
Font
. A collection of glyphs used for
the visual depiction of character data. A font is often associated
with a set of parameters (for example, size, posture, weight, and serifness), which, when set
to particular values, generate a collection of imagable glyphs.
Format Character
. A character that is inherently invisible but that
has an effect on the surrounding characters.
Format Code
. Synonym for
format character
Format Control Character
. Synonym for
format character
Formatted Text
. (See
rich text
.)
FSS-UTF
. Acronym for
File System Safe UCS Transformation Format
published by the X/Open Company Ltd., and intended for the UNIX
environment. Now known as
UTF-8
Full Composition Exclusion
. A Canonical Decomposable Character which has the property value Full_Composition_Exclusion=True.
(Used in the definition of Unicode Normalization Forms.) (See
definition D113 in
Section 3.11, Normalization Forms
.)
Fullwidth
. Characters of East Asian
character sets whose glyph image extends across the entire character
display cell. In legacy character sets, fullwidth characters are normally encoded in two or
three bytes. The Japanese term for fullwidth characters is
zenkaku
FVS
. Acronym for
Mongolian Free Variation Selector
G11n
. (See
globalization
.)
GC
1. Acronym for
grapheme cluster
. 2. Short name for the General_Category property, usually lowercased: gc.
GCGID
. Acronym for Graphic Character
Global Identifier. These are listed in the IBM document
Character
Data Representation Architecture, Level 1, Registry SC09-1391
General Category
. Partition
of the characters into major classes such as letters, punctuation,
and symbols, and further subclasses for each of the major classes.
(See
Section 4.5, General Category
.)
Generative
. Synonym for
productive
Globalization
. (1) The overall process for internationalization and localization of software products. (2) a synonym for internationalization. Also known by the abbreviation "g11n". Note that the meaning of "globalization" which is relevant to software products should be distinguished from the more widespread use of "globalization" in the context of economics. (See
internationalization
localization
.)
Glyph
. (1) An abstract form that
represents one or more glyph images. (2) A synonym for
glyph image
. In displaying Unicode
character data, one or more glyphs may be selected to depict a
particular character. These glyphs are selected by a rendering
engine during composition and layout processing. (See also
character
.)
Glyph Code
. A numeric code that
refers to a glyph. Usually, the glyphs contained in a font are
referenced by their glyph code. Glyph codes may be local to a
particular font; that is, a different font containing the same
glyphs may use different codes.
Glyph Identifier
. Similar to
a glyph code, a glyph identifier is a label used to refer to a glyph
within a font. A font may employ both local and global glyph
identifiers.
Glyph Image
. The actual, concrete
image of a glyph representation having been rasterized or otherwise
imaged onto some display surface.
Glyph Metrics
. A collection of
properties that specify the relative size and positioning along with
other features of a glyph.
Grapheme
. (1) A minimally
distinctive unit of writing in the context of a particular writing
system. For example, ‹b› and ‹d› are distinct graphemes in English
writing systems because there exist distinct words like big and dig.
Conversely, a lowercase italiform letter
and a
lowercase Roman letter a are not distinct graphemes because no word
is distinguished on the basis of these two different forms. (2) What
a user thinks of as a character.
Grapheme Base
. A character
with the property Grapheme_Base, or any standard Korean syllable
block. (See definition D58 in
Section 3.6, Combination
.)
Grapheme Cluster
. The text between grapheme cluster boundaries as specified by
Unicode Standard Annex #29, "Unicode Text Segmentation."
(See definition D60 in
Section 3.6, Combination
.) A grapheme cluster represents a horizontally segmentable unit of text, consisting of some grapheme base (which may consist of a Korean syllable) together with any number of nonspacing marks applied to it.
Grapheme Extender
. A
character with the property Grapheme_Extend. (See definition D59 in
Section 3.6, Combination
.) Grapheme extender characters consist of
all nonspacing marks,
zero
width joiner
zero
width non-joiner
, and a small number of spacing marks.
Graphic Character
. A
character with the General Category of Letter (L), Combining Mark
(M), Number (N), Punctuation (P), Symbol (S), or Space Separator
(Zs). (See definition D50 in
Section 3.6. Combination
.)
Guillemet
. Punctuation marks
resembling small less-than and greater-than signs, used as quotation
marks in French and other languages. (See “Language-Based Usage of
Quotation Marks” in
Section 6.2, General Punctuation
.)
Halant
. A preferred Hindi synonym
for a
virama
. It literally means
killer
, referring to
its function of
killing
the inherent vowel of a consonant
letter. (See
virama
.)
Half-Consonant Form
. In
the Devanagari script and certain other scripts of the Brahmi family
of Indic scripts, a dead consonant may be depicted in the so-called
half-form. This form is composed of the distinctive part of a
consonant letter symbol without its vertical stem. It may be used to
create conjunct forms that follow a horizontal layout pattern. Also
known as
half-form
Halfwidth
. Characters of East Asian
character sets whose glyph image occupies half of the character
display cell. In legacy character sets, halfwidth characters are
normally encoded in a single byte. The Japanese term for halfwidth characters is
hankaku
Han Characters
. Ideographic
characters of Chinese origin. (See
Section 18.1, Han
.)
Hangul
. The name of the script used to
write the Korean language.
Hangul Syllable
. (1) Any of
the 11,172 encoded characters of the Hangul Syllables character
block, U+AC00..U+D7A3. Also called a
precomposed Hangul syllable
to clearly distinguish it from a Korean syllable block. (2) Loosely
speaking, a
Korean syllable block
Hanja
. The Korean name for Han
characters; derived from the Chinese word
hànzì
Hankaku
. (See
halfwidth
.)
Han Unification
. The process
of identifying Han characters that are in common among the writing
systems of Chinese, Japanese, Korean, and Vietnamese.
Hànzì
. The Mandarin Chinese name
for Han characters.
Harakat
Marks used in the Arabic script to indicate
vocalization with short vowels. A subtype of
tashkil
Hasant
. The Bangla name for
halant
. (See
virama
.)
Higher-Level Protocol
Any agreement on the interpretation of Unicode characters that
extends beyond the scope of this standard. Note that such an
agreement need not be formally announced in data; it may be implicit
in the context. (See definition D16 in
Section 3.4,
Characters and Encoding
.)
High-Surrogate Code Point
A Unicode code point in the range U+D800 to U+DBFF. (See definition
D71 in
Section 3.8, Surrogates
.)
High-Surrogate Code Unit
A 16-bit code unit in the range D800
16
to DBFF
16
used in UTF-16 as the leading code unit of a surrogate pair. Also
known as a
leading surrogate
. (See definition D72 in
Section 3.8, Surrogates
.)
Hiragana
(ひらがな). One of two standard
syllabaries associated with the Japanese writing system. Hiragana
syllables are typically used in the representation of native
Japanese words and grammatical particles, or are
used as a fallback representation of other words when the corresponding
kanji is either difficult to remember or obscure.
(See also
katakana
.)
Horizontal Extension
. This refers to the process of adding a new IRG source reference to an existing CJK unified ideograph, along with a new representative glyph for the code charts that shows how the character appears in its source. It does not involve encoding a new character, but rather just adding the source reference and new glyph to the code charts.
HTML
. HyperText Markup Language. A text
description language related to SGML; it mixes text format markup
with plain text content to describe formatted text. HTML is
ubiquitous as the source language for Web pages on the Internet.
Starting with HTML 4.0, the Unicode Standard functions as the
reference character set for HTML content. (See also
SGML
.)
I18n
. (See
internationalization
.)
IANA
. Acronym for Internet Assigned
Numbers Authority.
ICU
. Acronym for International Components for Unicode, an Open
Source set of C/C++ and Java libraries for Unicode and software
internationalization support. For information, see
Ideograph
(or
ideogram
). (1) Any symbol that primarily denotes an idea or concept in contrast to a sound or pronunciation—for example, ♻, which denotes the concept of recycling by a series of bent arrows. (2) A generic term for the unit of writing of a logosyllabic writing system. In this sense, ideograph (or ideogram) is not systematically distinguished from logograph (or logogram). (3) A term commonly used to refer specifically to Han characters, equivalent to the Chinese, Japanese, or Korean terms also sometimes used:
hànzì
kanji
, or
hanja
. (See
logograph
pictograph
sinogram
.)
Ideographic Property
Informative property of characters that are ideographs. (See
Section 4.10, Letters, Alphabetic, and Ideographic
.)
Ideographic Variation Sequence
. A
variation sequence
registered in the
Ideographic Variation Database
. The registration of ideographic variation sequences is subject to the rules specified in
Unicode Technical Standard #37, "Unicode Ideographic Variation Database."
The base character for an ideographic variation sequence must be an ideographic character, and it makes use of a
variation selector
in the range U+E0100..U+E01EF. The term ideographic variation sequence is sometimes abbreviated as "IVS".
IDN
. (See
Internationalized Domain Name
.)
IDNA
(1) The IDNA2008 protocol for
IDNs
defined in RFCs
5891
5892
5893
and
5894
. The protocol categorizes characters (for example as PVALID or DISALLOWED) based on Unicode properties as described in RFC
5892
. (For the range of valid code points for each Unicode version, see the data file for the derived
IDNA2008_Category
property.)
(2) The earlier IDNA2003 protocol. (See
IDNA Compatibility Processing
for differences between
IDNA2003
and
IDNA2008
.)
IDNA Compatibility Processing
. (See
Unicode Technical Standard #46, "Unicode IDNA Compatibility Processing"
.)
IDNA2003
. (See
IDNA
(2).)
IDNA2008
. (See
IDNA
(1).)
IICore
. A subset of common-use CJK unified ideographs, defined as
the fixed collection 370 IICore in ISO/IEC 10646. This subset
contains 9,810 ideographs and is intended for common use in East
Asian contexts, particularly for small devices that cannot support
the full range of CJK unified ideographs encoded in the Unicode
Standard.
Ijam
Diacritical marks applied to basic letter forms
to derive new (usually consonant) letters for extended Arabic
alphabets. For example, see the three dots below which appear in the letter peh:
Ijam
marks are not separately encoded as
combining marks in the Unicode Standard, but instead are integral
parts of each atomically encoded Arabic letter.
Contrast
tashkil
. See also
Section 9.2, Arabic
Ill-Formed Code Unit Sequence
A code unit sequence that does not follow the specification of a
Unicode encoding form. (See definition D84 in
Section 3.9, Unicode Encoding Forms
.)
Ill-Formed Code Unit Subsequence
A non-empty subsequence of a Unicode code unit sequence X which does not contain any code units which also belong to any minimal well-formed subsequence of X. (See definition D84a in
Section 3.9, Unicode Encoding Forms
.)
IME
. (See
Input Method Editor
.)
In-Band
. An in-band channel conveys
information about text by embedding that information within the text
itself, with special syntax to distinguish it. In-band information
is encoded in the same character set as the text, and is
interspersed with and carried along with the text data. Examples are
XML and HTML markup.
Independent Vowel
. In Indic
scripts, certain vowels are depicted using independent letter
symbols that stand on their own. This is often true when a word
starts with a vowel or a word consists of only a vowel.
Indic Digits
. Forms of decimal
digits used in various Indic scripts (for example, Devanagari: U+0966, U+0967, U+0968, U+0969).
Arabic digits (and, eventually, European digits) derive historically
from these forms. See
Terminology for Digits
for additional information on terminology related to digits.
Informative
. Information in this
standard that is not normative but that contributes to the correct
use and implementation of the standard.
Inherent Vowel
. In writing
systems based on a script in the Brahmi
family of Indic scripts, a consonant letter symbol normally has an
inherent vowel, unless otherwise indicated. The phonetic value of
this vowel differs among the various languages written with these
writing systems. An inherent vowel is overridden either by
indicating another vowel with an explicit vowel sign or by using
virama
to create a dead consonant.
Inner Caps
. Mixed case format
where an uppercase letter is in a position other than first in the
word—for example, “G” in the Name “McGowan.”
Input Method Editor
(IME)
. A UI-based method of inputting characters with more flexibility and range than possible for a traditional keyboard. An IME may use keystrokes, phonetic spelling or touch screen input to select characters from a candidate list, or to select from partial or complete candidate words.
Intellectual Property
. A work or invention that is the result of creativity, such as a manuscript or a design, to which one has rights and for which one may apply for a patent, copyright, trademark, etc. (Oxford Languages definition.) Note that the concept of
intellectual property
, while relevant to the Unicode Standard and its licensing, is distinct from the usual sense of
property
in the standard, as for
character properties
alphabetic property
ideographic property
script property
, etc.
Internationalization
. The process of designing and implementing a software product so that it can be easily localized, with few if any structural changes. Ideally, an internationalized software product can be localized simply by translating messages and other text displayed to a user, and by adapting icons and other visual elements. An "internationalized" software product is also known as a "localizable" product. Also known by the abbreviation "i18n" and the term "World-Readiness". (See
localization
globalization
.)
Internationalized Domain Name
(IDN)
. A domain name using at least one character outside the ASCII range. (See also
IDNA
.)
IPA
. (1) The International Phonetic Alphabet. (2) The International Phonetic Association, which defines and maintains the International Phonetic Alphabet.
IRG
. Acronym for Ideographic Research Group, a subgroup of ISO/IEC JTC1/SC2/WG2. (See
Appendix E, Han Unification History
.)
ISCII
. Acronym for Indian Script Code for Information Interchange.
ISO 10646
. (See
ISO/IEC 10646
.)
ISO/IEC 10646
. A character encoding standard maintained by ISO in synchronization with the Unicode Standard.
Jamo
. The Korean name for a single
letter of the
Hangul
script.
Jamos
are used to form Hangul syllables.
Joiner
. An invisible character that
affects the joining behavior of surrounding characters. (See
Section 9.2, Arabic
, and “Cursive Connection” in
Section 23.2, Layout Controls
.)
Jongseong
. A sequence of one or
more trailing consonants in Korean.
JTC1
. The Joint Technical Committee 1 of
the International Organization for Standardization and the
International Electrotechnical Commission responsible for
information technology standardization.
Jungseong
. A sequence of one or
more vowels in Korean.
Kana
A collective term for the two syllabic scripts used
(along with
kanji
and
romaji
) by the Japanese
writing system. The two forms are
hiragana
and
katakana
Kanji
(漢字). The Japanese name for Han characters; derived from the
Chinese word
hànzì
. Also romanized as
kanzi
Katakana
(カタカナ). One of two standard syllabaries associated with the
Japanese writing system. Katakana syllables are typically used in
representation of borrowed vocabulary (other than that of Chinese origin), some plant
or animal names, sound-symbolic interjections, emphasis, or for phonetic representation of
“difficult” kanji characters in Japanese.
(See also
hiragana
.)
Kerning
. (1) Changing the space between certain pairs of letters to
improve the appearance of the text. (2) The process of mapping from
pairs of glyphs to a positioning offset used to change the space
between letters.
Korean Syllable Block
A sequence of Korean jamos, consisting of one or more leading
consonants followed by one or more vowels followed by zero or more
trailing consonants, or any canonically equivalent sequence
including a precomposed Hangul syllable. In regular expression
notation:
L L* V V* T*
. Also called a
standard
Korean syllable block
. (See
Section 3.12, Conjoining Jamo Behavior
.)
L10n
. (See
localization
.)
Layout Control
. Synonym for
format character
. The term
layout control
is often used
when the format character(s) in question affect the overall layout or rendering of text.
See
Section 23.2, Layout Controls
LDML
. (See
Unicode Locale Data Markup Language
.)
Leading Consonant
. (1) In Korean, a jamo character with the
Hangul_Syllable_Type property value Leading_Jamo (in the range
U+1100..U+1159 or U+115F
hangul choseong filler
). Abbreviated as
. (See definition D122 in
Section 3.12, Conjoining Jamo Behavior
.) (2)
Any initial consonant in a syllable.
Leading Surrogate
. Synonym for
high-surrogate code unit
Letter
. (1) An element of an alphabet. In a broad sense, it includes
elements of syllabaries and ideographs. (2) Informative property of
characters that are used to write words.
Ligature
. A glyph representing a combination of two or more
characters. In the Latin script, there are only a few in modern use,
such as the ligatures between “f” and “i” or “f” and “l”. Other scripts make use of many ligatures, depending on the font
and style.
Little-endian
. A computer architecture that stores multiple-byte
numerical values with the least significant byte (LSB) values first.
Localization
. (1) The process of adapting a software product to use the languages and conventions suitable for a local market, such as adapting an English US software product to work in Spanish for Argentina. (2) The management of software product translation, which includes extraction of translatable text, management of translations, and generation of language resource modules. Also known by the abbreviation "L10n". Localization produces "localized" software products. (See
internationalization
globalization
.)
Logical Order
. The order in
which text is stored in the memory representation. For the most part, logical order
corresponds to the typing order and the phonetic order. (See
display order
and
Section 2.2, Unicode Design Principles
.)
Logical Store
. Memory representation.
Logograph
(or
logogram
). (1) Any symbol that primarily represents a word (or morpheme) in contrast to a sound or pronunciation. (2) A generic term for the unit of writing of a logosyllabic writing system. In this sense, logograph (or logogram) is not systematically distinguished from ideograph (or ideogram). (See
ideograph
pictograph
.)
Logosyllabary
. A writing system in which the units are used primarily to write words and/or morphemes of words, with some subsidiary usage to represent just syllabic sounds. The best example is the Han script.
Lowercase
. (See
case
.)
Low-Surrogate Code Point
A Unicode code point in the range U+DC00 to U+DFFF. (See definition
D73 in
Section 3.8, Surrogates
.)
Low-Surrogate Code Unit
A 16-bit code unit in the range DC00
16
to DFFF
16
, used in UTF-16 as
the trailing code unit of a surrogate pair. Also known as a
trailing
surrogate
. (See definition D74 in
Section 3.8, Surrogates
.)
LSB
. Acronym for
least significant
byte
LZW
. Acronym for
Lempel-Ziv-Welch
a standard algorithm widely used for compression of data.
Majuscule
. Synonym for
uppercase
. (See
case
.)
Mathematical Property
. Informative property of characters that are
used as operators in mathematical formulae.
Matra
. A dependent vowel in an Indic script. It is the name for
vowel letters that follow consonant letters in logical order. A
matra often has a completely different letterform from that for the
same phonological vowel used as an independent letter.
MBCS
. Acronym for
multibyte character set
MCM
. Acronym for modifier combining mark, a subset of combining marks used in the
Arabic Mark Transient Reordering Algorithm
. (See
Unicode Standard Annex #53, “Unicode Arabic Mark Rendering.”
MIME
. Multipurpose Internet Mail Extensions. MIME is a standard that
allows the embedding of arbitrary documents and other binary data of
known types (images, sound, video, and so on) into e-mail handled by
ordinary Internet electronic mail interchange protocols.
Minimal Well-Formed Code Unit Subsequence
A well-formed Unicode code unit sequence that maps to a single Unicode scalar value. (See definition D85a in
Section 3.9, Unicode Encoding Forms
.)
Minuscule
. Synonym for
lowercase
. (See
case
.)
Mirrored Property
. The
property of characters whose images are mirrored horizontally in
text that is laid out from right to left (versus from left to
right). (See
Section 4.7, Bidi Mirrored
.)
Missing Glyph
. (See
replacement glyph
.)
Modifier Letter
. A character
with the Lm General Category in the Unicode Character Database.
Modifier letters, which look like letters or punctuation, modify the
pronunciation of other letters (similar to diacritics). (See
Section 7.8, Modifier Letters
.)
Mongolian Free Variation Selector
. A subset of
variation selectors
, encoded in the range U+180B..U+180D, which are used specifically for the definition of
standardized variation sequences
for the Mongolian script. A Mongolian free variation selector always takes a Mongolian letter as its base, but is otherwise not architecturally distinct from the general-purpose variation selectors. Commonly abbreviated FVS1, FVS2, and FVS3.
Mongolian Variation Sequence
. A
standardized variation sequence
which has a Mongolian letter as its base character. All of the currently defined Mongolian variation sequences use the Mongolian free variation selectors in the range U+180B..U+180D, but no architectural constraint prevents such sequences from also using general-purpose variation selector characters in the range U+FE00..U+FE0D. Mongolian variation sequences are unusual in that many specify a positional context (for example, initial, medial, or final) for their use, as well as a particular glyph presentation.
Monotonic
. Modern Greek written with the basic accent, the
tonos
Mora
. A phonological term: the unit of
sound which determines syllable weight in some languages. Some
syllabaries have characteristics which reflect moraic structure more
or less exactly. In particular, the Japanese kana syllabaries
actually write one character per mora, rather than one character per
syllable. The Vai syllabary also counts final nasals as distinct
moras, and writes moras instead of syllables.
MSB
. Acronym for
most significant byte
Multibyte Character Set
. A character set encoded with a variable
number of bytes per character, often abbreviated as
MBCS
. Many large
character sets have been defined as MBCS so as to keep strict
compatibility with the
ASCII
subset and/or ISO/IEC 2022.
Named Unicode Algorithm
A Unicode algorithm that is specified in the Unicode Standard or in
other standards published by the Unicode Consortium and that is
given an explicit name for ease of reference. (See definition D18 in
Section 3.4,
Characters and Encoding
. See also Table 3-1, “Named
Unicode Algorithms,” for a list of named Unicode algorithms.)
Namespace
. (1) A set of names, no
two of which are identical. (2) A set of names together with name
matching rules, so that all names are distinct under the matching
rules. (See definition D6 in
Section 3.3, Semantics
.) Character
names are distinct if they do not match under the name matching
rules in effect for the standard.
Nekudot
. Marks that indicate vowels or other modifications of
consonantal letters in Hebrew.
Neutral Character
A character that can be written either right to
left or left to right, depending on context. (See
Unicode Standard
Annex #9, “Unicode Bidirectional Algorithm.”
NFC
. (See
Normalization Form C
.)
NFD
. (See
Normalization Form D
.)
NFKC
. (See
Normalization Form KC
.)
NFKD
. (See
Normalization Form KD
.)
Noncharacter
. A code point that
is permanently reserved and that will never be assigned to an abstract character.
Noncharacters consist of the values U+
FFFE and
U+
FFFF (where
is from 0 to 10
16
), and
the values U+FDD0..U+FDEF. See the FAQ on
Private-Use Characters, Noncharacters and Sentinels
Non-joiner
. An invisible character
that affects the joining behavior of surrounding characters. (See
Section 9.2, Arabic
, and “Cursive Connection” in
Section 23.2, Layout Controls
.)
Non-overridable
. A characteristic of a Unicode character property
that cannot be changed by a higher-level protocol.
Nonspacing Diacritic
. A diacritic that is a
nonspacing mark
Nonspacing Mark
. A combining
character with the General Category of Nonspacing Mark (Mn) or
Enclosing Mark (Me). (See definition D53 in
Section 3.6, Combination
.) The position of a nonspacing mark in presentation
depends on its base character. It generally does not consume space
along the visual baseline in and of itself. (See also
combining character
.)
Non-starter Decomposition
. A canonical decomposition mapping to a sequence of more than one character, for which the first character in that sequence is not a Starter.
(Used in the definition of Unicode Normalization Forms.) (See
definition D111 in
Section 3.11, Normalization Forms
.)
Normalization
. A process of
removing alternate representations of equivalent sequences from
textual data, to convert the data into a form that can be
binary-compared for equivalence. In the Unicode Standard,
normalization refers specifically to processing to ensure that
canonical-equivalent (and/or compatibility-equivalent) strings have
unique representations. For more information, see “Equivalent
Sequences” in
Section 2.2, Unicode Design Principles
, and
Section 3.11, Normalization Forms
Normalization Form
One of the four Unicode normalization forms
defined in
Section 3.11, Normalization Forms
—namely, NFC, NFD, NFKC, and NFKD.
For more information and examples, see Section 1.1, Canonical and
Compatibility Equivalence in
Unicode Standard Annex #15, "Unicode Normalization Forms."
Normalization Form C
NFC
A normalization form that erases any canonical differences, and
generally produces a composed result. For example, a + umlaut is
converted to ä in this form. This form most closely matches legacy
usage. The formal definition
is D120 in
Section 3.11, Normalization Forms
Normalization Form D
NFD
A normalization form that erases any canonical differences, and
produces a decomposed result. For example, ä is converted to a +
umlaut in this form. This form is most often used in internal
processing, such as in collation. The
formal definition is D118 in
Section 3.11, Normalization Forms
Normalization Form KC
NFKC
A normalization form that erases both canonical and compatibility
differences, and generally produces a composed result: for example,
the single dž character is converted to d + ž in this form. This form
is commonly used in matching. The
formal definition is D121 in
Section 3.11, Normalization Forms
Normalization Form KD
NFKD
A normalization form that erases both canonical and compatibility
differences, and produces a decomposed result: for example, the
single dž character is converted to d + z + caron in this form. The
formal definition is D119 in
Section 3.11, Normalization Forms
Normative
. Required for conformance with the Unicode Standard.
NSM
. Acronym for
nonspacing mark
Numeric Value Property
A property of characters used to represent numbers. (See
Section 4.6, Numeric Value
.)
Obsolete
. Applies to a character that is no longer in current use,
but that has been used historically. Whether a character is obsolete
depends on context: For example, the Cyrillic letter
big yus
is
obsolete for Russian, but is used in modern Bulgarian. (Not the same
as
deprecated
.)
Octet
. An ordered sequence of eight bits considered as a unit. The
Unicode Standard follows current industry practice in referring to
an octet as a
byte
. (See
byte
.)
Orthographic Syllable
. A two-dimensional visual arrangement of glyphs that forms a unit in
Brahmic scripts. Also known as
aksara
. At the core of an orthographic syllable is a base character, which can be a consonant, an independent vowel, or (in some scripts) a numeric
character, or a ligature formed from base characters and other characters. Attached to this
core may be dependent forms (such as half-forms, subjoined forms, repha forms, medial
forms) of consonants or independent vowels, as well as nukta marks, virama marks,
dependent vowel marks, register shifter marks, tone marks, final consonant marks, and
other marks. It is common for different components of orthographic syllables to form
ligatures. Orthographic syllables often don’t correspond to phonological syllables; it is common for the final consonants of phonological syllables to become the base characters, or sometimes dependent forms, of subsequent orthographic syllables.
Out-of-Band
An out-of-band channel conveys additional information
about text in such a way that the textual content, as encoded, is
completely untouched and unmodified. This is typically done by
separate data structures that point into the text.
Overridable
. A characteristic of a Unicode character property that
may be changed by a higher-level protocol to create desired
implementation effects.
Oxia
. Greek term for acute accent, used in polytonic Greek character names.
P-Q
Paragraph Direction
. The default direction (
left
or
right
) of the
text of a paragraph. This direction does not change the display
order of characters within an Arabic or English word. However, it
does
change the display order of adjacent Arabic and English words,
and the display order of neutral characters, such as punctuation and
spaces. For more details, see
Unicode Standard Annex #9, “Unicode
Bidirectional Algorithm,
” especially definitions BD2–BD5.
Paragraph Embedding Level
The embedding level that determines the default bidirectional
orientation of the text in that paragraph.
Perispomeni
. Greek term for circumflex accent, used in polytonic Greek character names.
Phoneme
. A minimally distinct sound in the context of a particular
spoken language. For example, in American English, /p/ and /b/ are
distinct phonemes because
pat
and
bat
are distinct; however, the two
different sounds of /t/ in
tick
and
stick
are not distinct in
English, even though they are distinct in other languages such as
Thai.
Pictograph
(or
pictogram
). Any symbol that denotes an object by means of a more or less conventional visual likeness—for example, ✈. (See
emoji
ideograph
logograph
.)
Pinyin
. Standard system for the romanization of Chinese on the basis of Mandarin pronunciation.
Pivot Conversion
. The use of a third character encoding to serve as
an intermediate step in the conversion between two other character
encodings. The Unicode Standard is widely used to support pivot
conversion, as its character repertoire is a superset of most other
coded character sets.
Plain Text
. Computer-encoded text that consists
only
of a sequence
of code points from a given standard, with no other formatting or
structural information. Plain text interchange is commonly used
between computer systems that do not share higher-level protocols.
(See also
rich text
.)
Plane
. A range of 65,536 (10000
16
contiguous Unicode code points, where the first code point is an
integer multiple of 65,536 (10000
16
). Planes are numbered
from 0 to 16, with the number being the first code point of the
plane divided by 65,536. Thus Plane 0 is U+0000..U+FFFF, Plane 1 is
U+
0000..U+
FFFF, ..., and Plane 16 (10
16
is U+
10
0000..
10
FFFF. (Note that ISO/IEC 10646 uses
hexadecimal notation for the plane numbers—for example, Plane B instead of
Plane 11). (See
Basic
Multilingual Plane
and
supplementary
planes
.)
Points
. (1) The nonspacing vowels and other signs of written Hebrew.
(2) A unit of measurement in typography.
Polytonic
. Ancient Greek written with several contrastive accents.
Precomposed Character
. (See
decomposable character
.)
Presentation Form
. A ligature or variant glyph that has been encoded
as a character for compatibility. (See also
compatibility character
(1).)
Primary Composite
. A
Canonical Decomposable Character which is not a Full Composition Exclusion. (Used in
the definition of Unicode Normalization Forms.) (See definition D114
in
Section 3.11, Normalization Forms
.)
Private Use
. Refers to designated code points in the Unicode
Standard or other character encoding standards whose interpretations
are not specified in those standards and whose use may be determined
by private agreement among cooperating users.
Private Use Area
(PUA)
. Any
one of the three blocks of private-use code points in the Unicode
Standard.
Private-Use Character
. A character that has been assigned to a
Private-use code point
by private agreement.
Private-Use Code Point
Code points in the ranges U+E000..U+F8FF, U+F0000..U+FFFFD, and
U+100000..U+10FFFD. (See definition D49 in
Section 3.5, Properties
.) These code points are designated in the
Unicode Standard for private use.
Productive
. Said of a feature or
rule that can be employed in novel combinations or circumstances,
rather than being restricted to a fixed list. In the Unicode
Standard, combining marks—particularly the accents—are productive.
In contrast, variation selectors are deliberately not productive.
Also known as
generative
Property
. (See
character properties
.)
Property Alias
. A unique
identifier for a particular Unicode character property. (See
definition D47 in
Section 3.5, Properties
.)
Property Value Alias
. A
unique identifier for a particular enumerated value for a particular
Unicode character property. (See definition D48 in
Section 3.5, Properties
.)
Prosgegrammeni
. Greek term for adscript iota, used in polytonic Greek character names.
Provisional
. A property or feature that is unapproved and tentative,
and that may be incomplete or otherwise not in a usable state.
Psili
. Greek term for smooth breathing mark, used in polytonic Greek character names.
PUA
. Acronym for
Private Use Area
Puḷḷi
. The Tamil name for
virama
. (See
virama
.)
Punycode
. Punycode is an algorithm, defined in
RFC 3492
, that compactly represents Unicode strings using 7-bit ASCII, while leaving the ASCII characters unmapped. Punycode strings can be identified by their characteristic prefix "xn--". Punycode is used to transform
IDNs
to an ASCII-compatible form.
Radical
. A structural component of a Han character conventionally
used for indexing. The traditional number of such radicals is 214.
Rendering
. (1) The process of selecting and laying out glyphs for
the purpose of depicting characters. (2) The process of making
glyphs visible on a display device.
Repertoire
. (See
character repertoire
.)
Replacement Character
A character used as a substitute for an uninterpretable character from another encoding. The Unicode
Standard uses U+FFFD
replacement character
for this function.
Replacement Glyph
. A glyph used to render a character that cannot be
rendered with the correct appearance in a particular font. It often
is shown as an open or black rectangle. Also known as a
missing
glyph
. (See
Section 5.3, Unknown and Missing Characters
.)
Reorderable Pair
. Two adjacent characters A and B in a coded character sequence
are a Reorderable Pair if and only if ccc(A) > ccc(B) > 0. (Used in the definition of Unicode Normalization Forms.) (See
definition D108 in
Section 3.11, Normalization Forms
.)
Reserved Code Point
Any code point of the Unicode Standard that is reserved for future assignment. Also known as an
unassigned code point
. (See definition D15 in
Section 3.4,
Characters and Encoding
, and
Section 2.4, Code Points and Characters
.)
RGI
. Acronym for Recommended for General Interchange. This is a term used in
Unicode Technical Standard #51, "Unicode Emoji"
, to refer to a subset of a larger set of emoji (or emoji sequences) that is intended to be widely supported across multiple platforms. See
use of RGI.
Rich Text
. Also known as
styled text
. The result of adding
information to plain text. Examples of information that can be added
include font data, color, formatting information, phonetic
annotations, interlinear text, and so on. The Unicode Standard does
not address the representation of rich text. It is expected that
systems and applications will implement proprietary forms of rich
text. Some public forms of rich text are available (for example, ODA,
HTML, and SGML). When everything except primary content is removed
from rich text, only plain text should remain.
Romaji
(ローマ字). The Japanese name for Roman characters, that is, the Latin script.
Latin characters are fairly commonly interspersed in the Japanese writing system, whether
as single letters, or for complete words, acronyms, or abbreviations borrowed from
other languages. (See also
kanji
and
kana
.)
Row
. A range of 256 contiguous Unicode code points, where the first
code point is an integer multiple of 256. Two code points are in the
same row if they share all but the last two hexadecimal digits. (See
plane
.)
SAM
. Acronym for Syriac abbreviation mark.
SBCS
. Acronym for
single-byte character set
. Any
one-byte character
encoding. This term is generally used in contrast with
DBCS
and/or
MBCS
Scalar Value
. (See
Unicode scalar value
.)
Script
. A collection of letters and
other written signs used to represent textual
information in one or more writing systems. For example, Russian is
written with a subset of the Cyrillic script; Ukrainian is written
with a different subset. The Japanese writing system uses several
scripts.
Script Code
. An identifier
referring to a specific script. ISO 15924, “Codes for the representation of names
of scripts,” is the ISO registration standard which specifies the framework for
both alphanumeric and numeric codes for script names. For example "Cher" is the
alphanumeric script code for the Cherokee script, and 445 is the corresponding
numeric script code. The Unicode Consortium is the registration authority and
maintains the
ISO 15924 code lists
Script Designator
. The
part of a character name (or block name), usually at the beginning of that name, which generally
identifies the script of the character (or that of most characters in the block). For example,
the substring "LATIN" in the character name "
LATIN
SMALL LETTER S" is a script designator.
Note that not
all character names and block names have script designators. The Script property in the
UCD should be used instead to determine the script of any particular character.
(See
Unicode Standard Annex #24, Unicode Script Property
.)
Scriptio Continua
. A writing
style without spaces or punctuation.
Script Property
. 1. An
enumerated character property that assigns every Unicode character a particular Script property
value. See
Unicode
Standard Annex #24, “Unicode Script Property.”
2. Informally, the term
script property
also refers to characteristics of the script as a whole, in such phrasings as, “Arabic
is a right-to-left script,” or “Lydian is an historic script.”
SCSU
. Acronym for Standard Compression
Scheme for Unicode. See
Unicode Technical
Standard #6, “A Standard Compression Scheme for Unicode.”
Semi-Stable Comparison
. (See
deterministic comparison
.)
SGML
. Standard Generalized Markup Language. A standard framework,
defined in ISO 8879, for defining particular text markup languages.
The SGML framework allows for mixing structural tags that describe
format with the plain text content of documents, so that fancy text
can be fully described in a plain text stream of data. (See also
HTML
XML
and
rich text
.)
Shaping Characters
. Characters that assume different glyphic forms
depending on the context.
Shift-JIS
A shifted encoding of the Japanese character encoding
standard, JIS X 0208, widely deployed in PCs.
Signature
. An optional code
sequence at the beginning of a stream of coded characters that
identifies the
character encoding scheme
used for the following
text. (See
Unicode signature
.)
Singleton Decomposition
. A canonical decomposition mapping from a character to a different single character.
(Used in the definition of Unicode Normalization Forms.) (See
definition D110 in
Section 3.11, Normalization Forms
.)
Sinogram
. A technical term for a Chinese character. In the Unicode Standard, sinograms are systematically referred to instead as CJK ideographs or Han ideographs. (See
ideograph
.)
SI Unit
. A unit of
the International System of Units, the modern form of the metric system.
This consists of the system of base units, such as the second, meter,
and kilogram, a set of derived units, such as newton, joule, and volt, and a set of modifying prefixes, such as milli-, nano-, pico-, mega-,
giga-, tera-, and so forth.
SJIS
. Acronym for
Shift-JIS
Small Letter
. Synonym for
lowercase letter
. (See
case
.)
Sorting
. (See
collation
.)
Spacing Mark
. A
combining character
that is not a
nonspacing mark. (See definition D55 in
Section 3.6, Combination
.) (See
nonspacing mark
.)
Stable Comparison
. (See
deterministic comparison
.)
Stable Sort
. A sort in which two records with a field that compares as equal will retain their relative order if sorted according to that field. This is a property of the sorting algorithm, and not of the comparison mechanism. For example, a bubble sort is stable, whereas a Quicksort is not.
Standard Korean Syllable Block
(See
Korean syllable block
.)
Standardized Variation Sequence
. A
variation sequence
defined in the
UCD
data file StandardizedVariants.txt. A standardized variation sequence cannot have an ideographic character as its base, and it makes use of a
variation selector
in the range U+FE00..U+FE0F or U+180B..U+180D. Note that U+FE0E and U+FE0F are reserved for special functions when applied to emoji base characters. See
Unicode Technical Standard #51, "Unicode Emoji."
The term standardized variation sequence is sometimes abbreviated as "SVS".
Starter
. Any code point (assigned or not) with combining class of zero (ccc=0).
(Used in the definition of Unicode Normalization Forms.) (See
definition D107 in
Section 3.11, Normalization Forms
.)
Static Form
. (See
decomposable character
.)
Styled Text
. (See
rich text
.)
Subtending Mark
. A format character whose graphic form extends under
a sequence of following characters—for example, U+0600
arabic number sign
Supplementary Character
. A Unicode encoded character having a
supplementary code point.
Supplementary Code Point
. A Unicode code point between U+10000 and
U+10FFFF.
Supplementary Planes
. Planes 1 through 16, consisting of the
supplementary code points.
Surrogate Character
. A misnomer. It would be an encoded character
having a surrogate code point, which is impossible. Do not use this
term.
Surrogate Code Point
. A
Unicode code point in the range U+D800..U+DFFF. Reserved for use by UTF-16, where a pair of
surrogate code units (a high surrogate followed by a low surrogate)
“stand in” for a supplementary code point.
Surrogate Code Unit
. This is a cover term for either a
high-surrogate code unit
or
low-surrogate code unit
Surrogate Pair
. A
representation for a single abstract character that consists of a
sequence of two 16-bit code units, where the first value of the pair
is a
high-surrogate code unit
, and the
second is a
low-surrogate code unit
(See definition D75 in
Section 3.8, Surrogates
.)
Syllabary
. A type of writing system
in which each symbol typically represents both a consonant and a
vowel, or in some instances more than one consonant and a vowel.
Syllable
. (1) An element of a syllabary. (2) A basic unit of
articulation that corresponds to a pulmonary pulse.
Syllable Block
. A sequence of
Korean characters that should be grouped into a single square cell
for display. (See
Section 3.12, Conjoining Jamo Behavior
.)
Symmetric Swapping
. The process of rendering a character with a
mirrored glyph when its resolved directionality is right-to-left in
a bidirectional context. (See
mirrored property
and
Unicode Standard
Annex #9, “Unicode Bidirectional Algorithm.”
Tagging
. The association of attributes of text with a point or range
of the primary text. The value of a particular tag is not generally
considered to be a part of the “content” of the text. A typical
example of tagging is to mark the language or the font for a portion
of text.
Tailorable
. A characteristic of an algorithm for which a
higher-level protocol may specify different results than those
specified in the algorithm. A tailorable algorithm without actual
tailoring is also known as a default algorithm, and the results of
an algorithm without tailoring are known as the default results.
Tanwin
Marks used in the Arabic script to indicate
vocalization with long vowels or postnasalization. A subtype of
tashkil
Tashkil
Marks used in the Arabic script to indicate
vocalization of text, as well as other types of phonetic guides
to correct pronunciation. A basic Arabic letter plus any of these types of marks is not encoded as a single, precomposed character, and should always be represented as a sequence of letter plus combining mark. Contrast
ijam
. See also
harakat
tanwin
, and
Section 9.2, Arabic
TES
. Acronym for
transfer encoding
syntax
TeX
. Computer language designed for use in typesetting—in
particular, for typesetting math and other technical material.
(According to Donald Knuth, TeX rhymes with the word
blecchhh
.)
Text Element
. A minimum unit of text in relation to a particular
text process, in the context of a given writing system. In general,
the mapping between text elements and code points is many-to-many.
(See
Chapter 2, General Structure
.)
Titlecase
. Uppercased initial letter followed by lowercase letters
in words. A casing convention often used in titles, headers, and
entries, as exemplified in this glossary.
Titlo Letter
. A superscripted letter (written above) used in Old Church Slavonic text.
TLD
. Acronym for
top-level domain
Tonal Sandhi
. A phonological
process whereby the tone associated with one syllable in a tonal
language influences the realization of a tone associated with a
neighboring syllable.
Tone Mark
. A
diacritic
or
nonspacing mark
that represents a phonemic
tone. Tone languages are common in Southeast Asia and Africa.
Because tones always accompany vowels (the syllabic nucleus), they
are most frequently written using functionally independent marks
attached to a vowel symbol. However, some writing systems such as
Thai place tone marks on consonant symbols; Chinese does not use
tone marks (except when it is written phonemically).
Tonemic
. Refers to the underlying,
distinctive units of a tonal system in a language. Tones of a tonal
language are often referred to by numbers (“tone 1,” “tone 2,” and
so on), and each tone has an idealized, specific tone level or
contour that is considered to be its tonemic value. The term was
created by analogy with
phonemic
Tonetic
. Refers to the surface,
actual pitch realization of tones in a tonal system. Tonetic values
are what can be directly measured by tracking pitch contours in
actual speech recordings. The term was created by analogy with
phonetic
Tonos
. The basic accent in modern Greek, having the form of an acute
accent.
Top-Level Domain
. The first-level set of
domain names
, consisting of generic top-level domains such as .com,
.edu, .org, and .gov, and country code top-level domains such as .jp, .cn, .de, and so forth. Often abbreviated as TLD.
Trailing Consonant
. (1)
In Korean, a jamo character with the Hangul_Syllable_Type property
value Trailing_Jamo (in the range U+11A8..U+11F9). Abbreviated as
. (See definition D128 in
Section 3.12, Conjoining Jamo Behavior
.) (2) Any final consonant in a syllable.
Trailing Surrogate
. Synonym for
low-surrogate code unit
Transcoding
. Conversion of character data between different
character sets.
Transfer Encoding Syntax
A reversible transformation applied to text and other data to allow
it to be transmitted—for example, Base64, uuencode.
Transformation Format
. A mapping from a coded character sequence to
a unique sequence of code units (typically bytes).
Triangulation
. (See
pivot conversion
.)
Typographic Interaction
Graphical application of one nonspacing mark in a position relative
to a grapheme base that is already occupied by another nonspacing
mark, so that some rendering adjustment must be done (such as
default stacking or side-by-side placement) to avoid illegible
overprinting or crashing of glyphs. (See definition D106 in
Section 3.11, Normalization Forms
.)
UAX
. Acronym for
Unicode Standard Annex
UBA
. Acronym for
Unicode Bidirectional Algorithm
UCA
. Acronym for
Unicode Collation Algorithm
UCD
. Acronym for
Unicode Character Database
. (See
Section 4.1, Unicode Character Database
.)
UCS
. Acronym for Universal Character
Set, which is specified by International Standard ISO/IEC 10646,
which is equivalent in repertoire to the Unicode Standard.
UCS-2
. ISO/IEC 10646 encoding form:
Universal Character Set coded in 2 octets, limited to the Basic
Multilingual Plane. (See
Appendix C, Relationship to ISO/IEC 10646
.)
UCS-4
. ISO/IEC 10646 encoding form:
Universal Character Set coded in 4 octets. (See
Appendix C, Relationship to ISO/IEC 10646
.)
Umlaut
. Two horizontal dots over a letter, as in German
Köpfe
. The
umlaut is not distinguished from the
diaeresis
in the Unicode
character encoding. (See
diaeresis
.)
Unassigned Character
. A code point that is not assigned to an abstract character. This refers to surrogate code points, noncharacters,
and reserved code points. (See
Section 2.4, Code Points and Characters
.)
Unassigned Code Point
. Synonym for
reserved code point
UNC
. Acronym for
Urgently Needed Character
Undesignated Code Point
. Synonym for
reserved code point
Unicameral
A script that has no
case
distinctions. Most often used
in the context of European alphabets.
Unicode
(1) The standard for digital representation of the characters used in writing all of the world's languages. Unicode provides a uniform means for storing, searching, and interchanging text in any language. It is used by all modern computers and is the foundation for processing text on the Internet. Unicode is developed and maintained by the Unicode Consortium:
. (2) A label applied to software internationalization and localization standards developed and maintained by the Unicode Consortium.
Unicode Algorithm
. The
logical description of a process used to achieve a specified result
involving Unicode characters. (See definition D17 in
Section 3.4,
Characters and Encoding
.)
Unicode Bidirectional Algorithm
The algorithm that specifies the positioning of characters in bidirectional text
containing characters that display from right to left, such as in Arabic or Hebrew. See
Unicode
Standard Annex #9, “Unicode Bidirectional Algorithm.”
Unicode Character Database
. A collection of files providing
normative and informative Unicode character properties and mappings.
(See
Chapter
4, Character Properties
, and the
Unicode Character Database
.)
Unicode Collation Algorithm
Tailorable text comparison mechanism used for searching, sorting,
and matching Unicode strings. See
Unicode Technical
Standard #10, “Unicode Collation Algorithm.”
Unicode Common Locale Data Repository
The repository of locale data in XML format maintained by the
Unicode Consortium (
).
This repository provides information needed in the localization of
software products into a wide variety of languages, supplying (among
other things): date, time, number, and currency formats; sorting,
searching, and matching information; and translated names for
languages, territories, scripts, currencies, and time zones. (See
also
Unicode Locale
Data Markup Language
.)
Unicode Consortium
. A standards development organization creating widely-used specifications related to character encoding, as well as for software internationalization and localization. Major projects are the Unicode Standard and the Unicode Locales Project, which defines repositories of standardized data needed
to develop software for particular regions and cultures. The Consortium was founded in 1991, and is headquartered in Mountain View, California. Its current members include major software corporations, governments, and academic institutions. See
Unicode Encoding Form
A character encoding form that assigns each Unicode scalar value to
a unique code unit sequence. The Unicode Standard defines three
standard Unicode encoding forms: UTF-8, UTF-16, and UTF-32. (See definition
D79 in
Section 3.9, Unicode Encoding Forms
.)
Unicode Encoding Scheme
. A specified byte serialization for a
Unicode encoding form, including the specification of the handling
of a
byte order mark
(BOM), if allowed. (See definition D94
in
Section 3.10, Unicode Encoding Schemes
.)
Unicode Locale Data Markup Language
The XML specification for the exchange of locale data, defined by
Unicode Technical Standard #35,
"Unicode Locale Data Markup Language (LDML)."
(See also
Unicode Common
Locale Data Repository
.)
Unicode Scalar Value
. Any Unicode
code point
except high-surrogate
and low-surrogate code points. In other words, the ranges of
integers 0 to D7FF
16
and E000
16
to 10FFFF
16
inclusive. (See definition D76 in
Section 3.9, Unicode Encoding Forms
.)
Unicode Signature
. An implicit marker to identify a file as
containing Unicode text in a particular encoding form. An initial
byte order mark
BOM
) may be used as a Unicode signature.
Unicode Standard Annex
An integral part of the Unicode Standard published as a separate
document.
Unicode String
. A code unit
sequence containing code units of a particular Unicode encoding form
(whether well-formed or not). (See definition D80 in
Section 3.9, Unicode Encoding Forms
.)
Unicode Technical Note
Informative publication containing information of possible interest
concerning the Unicode Standard or related topics.
Unicode Technical Report
Formally approved Unicode Consortium publication containing
informative technical analysis of a topic related to the Unicode
Standard.
Unicode Technical Standard
. Formally approved specification published by the
Unicode Consortium that is related to, but not part of, the Unicode
Standard.
Unicode Transformation Format
. An ambiguous synonym for either
Unicode encoding form
or
Unicode encoding scheme
. The latter terms
are now preferred.
Unification
. The process of identifying characters that are in
common among writing systems.
Unihan
. (1) The concept of unification of the encoding of CJK ideographs across multiple, distinct legacy East Asian character encoding standards. (2) Short for the Unicode Han Database. (See
Unicode Standard Annex #38, “Unicode Han Database (Unihan).”
Unikemet
. Short for the Unicode Egyptian Hieroglyph Database, a datafile containing source references and other extensive information about the hieroglyphs. (See
Unicode Standard Annex #57, “Unicode Egyptian Hieroglyph Database (Unikemet).”
UPA
. Acronym for Uralic Phonetic Alphabet.
Uppercase
. (See
case
.)
Urgently Needed Character
. This is a term of art used by the Ideographic Research Group (
IRG
). It refers to a CJK unified ideograph which has special urgency for encoding, typically because of administrative needs such as unencoded characters needed for geographic registries, and so forth. An urgently needed character (abbreviated as UNC) bypasses the large working sets that the IRG normally uses to process major CJK extensions, and instead is approved for encoding (by the UTC and SC2) more rapidly via small, individual proposals for encoding.
URO
. Acronym for Unified Repertoire and Ordering, the original set
of
CJK
unified ideographs used in the Unicode Standard.
User-Defined Character
(See
EUDC
.)
User-Perceived Character
What everyone thinks of as a character in their script.
UTF
. Acronym for
Unicode
(or
UCS
Transformation Format
UTF-2
. Obsolete name for
UTF-8
UTF-7
. Unicode (or UCS) Transformation Format, 7-bit encoding form,
specified by
RFC 2152
UTF-8
. A multibyte encoding for text that represents each Unicode character with 1 to 4 bytes, and which is backward-compatible with ASCII. UTF-8 is
the predominant form of
Unicode
in web pages. More technically: (1)
The
UTF-8 encoding form
. (2) The
UTF-8 encoding scheme
. (3) “UCS Transformation Format 8,” defined in Annex D of ISO/IEC 10646:2003, technically equivalent to the definitions in the Unicode Standard.
UTF-8 Encoding Form
. The
Unicode encoding form that assigns each Unicode scalar value to an
unsigned byte sequence of one to four bytes in length, as specified
in Table 3-6, "UTF-8 Bit Distribution." (See definition D92 in
Section 3.9, Unicode Encoding Forms
.)
UTF-8 Encoding Scheme
The Unicode encoding scheme that serializes a UTF-8 code unit
sequence in exactly the same order as the code unit sequence itself.
(See definition D95 in
Section 3.10, Unicode Encoding Schemes
.)
UTF-16
. A multibyte encoding for text that
represents each Unicode character with 2 or 4 bytes; it is not
backward-compatible with ASCII. It is the internal form of
Unicode
in
many programming languages, such as Java, C#, and JavaScript, and in
many operating systems. More technically: (1)
The
UTF-16 encoding
form
. (2) The
UTF-16 encoding scheme
. (3) “Transformation format for
16 planes of Group 00,” defined in Annex C of ISO/IEC 10646:2003;
technically equivalent to the definitions in the Unicode Standard.
UTF-16 Encoding Form
The Unicode encoding form that assigns each Unicode scalar value in
the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned
16-bit code unit with the same numeric value as the Unicode scalar
value, and that assigns each Unicode scalar value in the range
U+10000..U+10FFFF to a surrogate pair, according to Table 3-5,
“UTF-16 Bit Distribution.” (See definition D91 in
Section 3.9, Unicode Encoding Forms
.)
UTF-16 Encoding Scheme
The UTF-16 encoding scheme that serializes a UTF-16 code unit
sequence as a byte sequence in either big-endian or little-endian
formats. (See definition D98 in
Section 3.10, Unicode Encoding Schemes
.)
UTF-16BE
. The Unicode encoding
scheme that serializes a UTF-16 code unit sequence as a byte
sequence in big-endian format. (See definition D96 in
Section 3.10, Unicode Encoding Schemes
.)
UTF-16LE
. The Unicode encoding
scheme that serializes a UTF-16 code unit sequence as a byte
sequence in little-endian format. (See definition D97 in
Section 3.10, Unicode Encoding Schemes
.)
UTF-32
. A multibyte encoding for text that
represents each
Unicode
character with 4 bytes; it is not
backward-compatible with ASCII. More technically: (1) The
UTF-32 encoding form
. (2) The
UTF-32 encoding scheme
UTF-32 Encoding Form
. The
Unicode encoding form that assigns each Unicode scalar value to a
single unsigned 32-bit code unit with the same numeric value as the
Unicode scalar value. (See definition D90 in
Section 3.9, Unicode Encoding Forms
.)
UTF-32 Encoding Scheme
The Unicode encoding scheme that serializes a UTF-32 code unit
sequence as a byte sequence in either big-endian or little-endian
formats. (See definition D101 in
Section 3.10, Unicode Encoding Schemes
.)
UTF-32BE
. The Unicode encoding scheme
that serializes a UTF-32 code unit sequence as a byte sequence in
big-endian format. (See definition D99 in
Section 3.10, Unicode Encoding Schemes
.)
UTF-32LE
. The Unicode encoding scheme
that serializes a UTF-32 code unit sequence as a byte sequence in
little-endian format. (See definition D100 in
Section 3.10, Unicode Encoding Schemes
.)
UTN
. Acronym for
Unicode Technical Note
UTR
. Acronym for
Unicode Technical
Report
UTS
. Acronym for
Unicode Technical
Standard
Varia
. Greek term for grave accent,
used in polytonic Greek character names.
Variation Selector
. Any of three ranges of Unicode characters designated for use in defining a
variation sequence
. Variation selectors in the range U+FE00..U+FE0F are known as
general-use
variation selectors and are used for
standardized variation sequences
. Two of these, U+FE0E and U+FE0F, have specialized functions when used with
emoji
base characters. Variation selectors in the range U+180B..U+180D are known as
Mongolian Free Variation Selector
; their use is limited to standardized variation sequences for the Mongolian script. Variation selectors in the range U+E0100..U+E01EF are known as
ideographic
variation selectors and are used for
ideographic variation sequences
. Variation selectors are all nonspacing combing marks (General_Category=Mn). They have no graphic shape of their own; instead they function to pick out a particular, defined subset of potential graphic presentations for the base character to which they are applied. All variation selectors are
default ignorable
code points (DICP=Yes), meaning that if they are not interpretable in combination with their base character, they should be ignored for display, rather than shown with a nondisplayable glyph box, for example. See
Section 23.4, Variation Selectors
. The term variation selector is sometimes abbreviated as "VS".
Variation Sequence
. A sequence consisting of precisely two code points: a
base character
(or a spacing mark [gc=Mc]) followed by a single
variation selector
. The two-character sequence is referred to as a
variant
of the base character or spacing mark. The function of the variation sequence is to pick out a particular, defined subset of potential graphic presentations of the base character (or spacing mark). Not all potential combinations of base characters and variation selectors have interpretations. Only the following variation sequences have valid interpretations: a
standardized variation sequence
, a registered
ideographic variation sequence
, and a variation sequence with an emoji base character defined in the data files associated with
Unicode Technical Standard #51, "Unicode Emoji"
Virama
. From Sanskrit. The name of a sign used in many Indic
and other Brahmi-derived scripts to suppress (or "kill") the inherent vowel of
the consonant to which it is applied, thereby generating a
dead
consonant
. (See
Section 12.1, Devanagari
.) The sign varies in shape
from script to script, and may be known by other names in various
languages. For example, in Hindi it is known as
hal
or
halant
, in
Bangla it is called
hasant
, and in Tamil it is called
pulli
The Unicode Standard uses the term
virama
in three specialized senses:
For a visible mark which suppresses the inherent vowel of the consonant to
which it is applied. This corresponds to characters with
Indic_Syllabic_Category=Pure_Killer.
For invisible formatting characters that create the conjunct form
of a subsequent or preceding consonant or independent vowel, or of the
preceding or subsequent consonants combined. This corresponds to characters
with Indic_Syllabic_Category=Invisible_Stacker.
For characters that can act either as a Pure_Killer or an Invisible_Stacker
at the discretion of the font engineer. This corresponds to characters
with Indic_Syllabic_Category=Virama.
Visual Ambiguity
. A situation arising from two characters (or
sequences of characters) being rendered indistinguishably.
Visual Order
. Characters ordered as they are presented for reading.
(Contrast with
logical order
.)
Vocalization
. Marks placed above, below, or within consonants to
indicate vowels or other aspects of pronunciation. A feature of
Middle Eastern scripts.
Vowel
. 1. In phonetics, a
sound characterized by open articulation, whose quality is then
dependent on the shape of the vocal tract and the resulting harmonics
in the sound wave. 2. In writing systems such as alphabets, a character
used to represent a vowel sound.
3. Specifically in Korean, a jamo character with
the Hangul_Syllable_Type property value Vowel_Jamo (in the range
U+1161..U+11A2 or U+1160
hangul jungseong filler
). Abbreviated as
(See definition D125 in
Section 3.12, Conjoining Jamo Behavior
.)
Vowel Mark
. In many scripts, a mark used to indicate a vowel or
vowel quality.
Vowel Sign
. A combining mark
representing a dependent vowel, attached to a base character that usually represents a consonant.
The term
vowel sign
is usually applicable in discussion of
abugida
writing systems. For Indic writing systems, in particular,
vowel sign
is often used
as a synonym for
matra
Vrachy
. Greek term for breve accent, used in polytonic Greek character names.
W3C
. Acronym for World Wide Web Consortium.
wchar_t
. The ANSI C defined
wide character
type, usually implemented
as either 16 or 32 bits. ANSI specifies that wchar_t be an integral
type and that the C language source character set be mappable by
simple extension (zero- or sign-extension).
Well-Formed Code Unit Sequence
A code unit sequence that follows the specification of a Unicode encoding form. (See definition D85 in
Section 3.9, Unicode Encoding Forms
.)
Writing Direction
. The direction or orientation of writing
characters within lines of text in a writing system. Three
directions are common in modern writing systems: left to right,
right to left, and top to bottom.
Writing System
. A set of rules for using one or more scripts to
write a particular language. Examples include the American English
writing system, the British English writing system, the French
writing system, and the Japanese writing system.
X-Y
XML
. eXtensible Markup Language. A subset of SGML constituting a
particular text markup language for interchange of structured data.
The Unicode Standard is the reference character set for XML content.
(See also
SGML
and
rich text
.) XML is a trademark of the World Wide
Web Consortium.
Ypogegrammeni
. Greek term for subscript iota, used in polytonic Greek character names.
Y-variant
. Two CJK unified ideographs with identical semantics and
non-unifiable shapes, for example, U+732B and U+8C93. (See
Z-variant
Z-variant
. Two CJK unified ideographs with identical semantics and
unifiable shapes, for example, U+8AAA and U+8AAC. (See
Y-variant
.)
Zenkaku
. (See
fullwidth
.)
Zero Width
. Characteristic of some spaces or format control
characters that do not advance text along the horizontal baseline.
(See
nonspacing mark
.)
US