FAQ - UTF-8, UTF-16, UTF-32 & BOM

FAQ - UTF-8, UTF-16, UTF-32 & BOM
UTF-8, UTF-16, UTF-32 & BOM
General questions, relating to UTF or Encoding Form
Q: Is Unicode a 16-bit encoding?
In its first version, from 1991 to 1995, Unicode was a 16-bit encoding, but starting with Unicode 2.0 (July, 1996), the Unicode Standard has
encoded characters
in the range U+0000..U+10FFFF, which amounts to a 21-bit code space. Depending on the
encoding form
you choose (
UTF-8
UTF-16
, or
UTF-32
), each character will then be represented either as a sequence of one to four 8-bit
bytes
, one or two 16-bit
code units
, or a single 32-bit code unit.
Q: Can Unicode text be represented in more than one way?
There are several possible representations of Unicode data, including
UTF-8
UTF-16
and
UTF-32
. They are all able to represent all of Unicode, but they differ for example in the number of bits for their constituent
code units
In addition, there are compression transformations such as the one described in the
UTS #6: A Standard Compression Scheme for Unicode
SCSU
).
Q: What is a UTF?
Unicode transformation format
UTF
) is an algorithmic mapping from every Unicode
code point
(except
surrogate code points
) to a unique
byte
sequence. The
ISO/IEC 10646
standard uses the term “
UCS
transformation format
” for UTF; the two terms are merely synonyms for the same concept.
Each UTF is reversible, thus every UTF supports
lossless round tripping
: mapping from any Unicode
coded character sequence
S to a sequence of bytes and back will produce S again. To ensure round tripping, a UTF mapping
must
have a mapping for all code points (except surrogate code points). This includes reserved or
unassigned code points
and the 66
noncharacters
(including U+FFFE and U+FFFF). In addition to being lossless, UTFs are unique: any given coded character sequence will always result in the same sequence of bytes for a given UTF.
The
SCSU
compression method, even though it is reversible, is not a UTF because the same string can map to very many different byte sequences, depending on the particular
SCSU
compressor.
[AF]
Q: Where can I get more information on encoding forms?
For the formal definition of
UTFs
see
Section 3.9, Unicode Encoding Forms
in
The Unicode Standard
. For more information on
encoding forms
see
UTR #17: Unicode Character Encoding Model
[AF]
Q: How do I write a UTF converter?
The freely available open source project
International Components for Unicode (ICU)
has
UTF
conversion built into it. The latest version may be
downloaded
from the
ICU
Project GitHub site. Many other libraries may have built-in converters, so you may not have to write your own.
[AF]
Q: Are there any byte sequences that are not generated by a UTF? How should I interpret them?
None of the
UTFs
can generate
every
arbitrary
byte
sequence. For example, in
UTF-8
every byte of the form 110xxxxx
must
be followed with a byte of the form 10xxxxxx
. A sequence such as <110xxxxx
0xxxxxxx
> is illegal, and must never be generated. When faced with this illegal byte sequence while transforming or interpreting, a UTF-8 conformant process must treat the first byte 110xxxxx
as an illegal termination error: for example, either signaling an error, filtering the byte out, or representing the byte with a marker such as U+FFFD
REPLACEMENT CHARACTER
. In the latter two cases, it will continue processing at the second byte 0xxxxxxx
A conformant process
must not
interpret illegal or ill-formed byte sequences as characters, however, it may take error recovery actions. No conformant process may use irregular byte sequences to encode
out-of-band
information.
Q: Which of the UTFs do I need to support?
UTF-8
is most common on the web.
UTF-16
is used by Java and Windows (.Net). UTF-8 and
UTF-32
are used by Linux and various Unix systems. The conversions between all of them are algorithmically based, fast and lossless. This makes it easy to support data input or output in multiple formats, while using a particular
UTF
for internal storage or processing.
[AF]
Q: What are some of the differences between the UTFs?
The following table summarizes some of the
properties
of each of the
UTFs
Name
UTF-8
UTF-16
UTF-16BE
UTF-16LE
UTF-32
UTF-32BE
UTF-32LE
Smallest code point
0000
0000
0000
0000
0000
0000
0000
Largest code point
10FFFF
10FFFF
10FFFF
10FFFF
10FFFF
10FFFF
10FFFF
Code unit size
8 bits
16 bits
16 bits
16 bits
32 bits
32 bits
32 bits
Byte order
N/A

big-endian
little-endian

big-endian
little-endian
Fewest bytes per character
Most bytes per character
In the table <
BOM
> indicates that the
byte
order is determined by a
byte order mark
, if present at the beginning of the data stream, otherwise it is
big-endian
[AF]
Q: Why do some of the UTFs have a BE or LE in their label, such as UTF-16LE?
UTF-16
and
UTF-32
use
code units
that are two and four
bytes
long respectively. For these
UTFs
, there are three sub-flavors: BE, LE and unmarked. The BE form uses
big-endian
byte serialization
(most significant byte first), the LE form uses
little-endian
byte serialization (least significant byte first) and the unmarked form uses big-endian byte serialization by default, but may include a
byte order mark
at the beginning to indicate the actual byte serialization used.
[AF]
Q: Is there a standard method to package a Unicode character so it fits an 8-Bit ASCII stream?
There are several options for making Unicode fit into an 8-bit format:
Use
UTF-8
. This preserves
ASCII
, but not Latin-1, because the characters >127 are different from Latin-1. UTF-8 uses the
bytes
in the ASCII only for ASCII characters. Therefore, it works well in any environment where ASCII characters have a significance as syntax characters, e.g. file name syntaxes, markup languages, etc., but where the all other characters may use arbitrary bytes.
For example
: U+015B
LATIN SMALL LETTER S WITH ACUTE
would be encoded as two bytes: C5 9B.
Use Java or C style escapes, of the form \uXXXX or \xXXXX. This format is not standard for text files, but well defined in the framework of the languages in question, primarily for source files.
For example
: The Polish word “wyjście” with character U+015B
LATIN SMALL LETTER S WITH ACUTE
in the middle (ś is one character) would look like: “wyj\u015Bcie”.
Use the
&#xXXXX;
or
&#DDD;
numeric
character escapes
as in HTML or
XML
. Again, these are not standard for
plain text
files, but well defined within the framework of these markup languages.
For example
: “wyjście” would look like “
wyjście
Use
Punycode
for converting labels that are part of network identifiers into a form compatible with ASCII labels. This method is required as part of
IDNA 2008
and earlier for
Internationalized Domain Names
IDN
). It re-encodes Unicode into a subset of ASCII characters containing only the letters and
digits
For example
: the
domain name
“wyjście.com” would look like “
xn--wyjcie-5ib.com
”, with the “xn--” prefix marking it as
punycode
and with any ASCII characters collected at the front.
Use
SCSU
. This format compresses Unicode into 8-bit format, preserving most of ASCII, but using some of the
control codes
as commands for the decoder. However, while ASCII text will look like ASCII text after being encoded in
SCSU
, other characters may occasionally be encoded with the same byte values, making SCSU unsuitable for 8-bit channels that blindly interpret any of the bytes as ASCII characters.
For example
: “ wyjÛcie” where indicates the byte 0x12 and “Û” corresponds to byte 0xDB.
[AF]
Q: Which method of packing Unicode characters into an 8-bit stream is the best?
The choice of approach depends on the circumstances:
UTF-8
is the most widely used ASCII-compatible
encoding form
for Unicode. it is designed to be used
transparently
, meaning that any part of the data that was in
ASCII
is still in ASCII (and without change in relative location) and no other parts are. It is also reasonably compact and independent of
byte
order
issues. It has become the preferred form for Unicode text files.
The downside of UTF-8 is that without converting into a format that can be displayed on your system, you cannot tell which non-ASCII characters are in your data.
Character escapes
or numeric
character entities
let you see which
code point
had been escaped, even if you are working in an ASCII environment or don't have the
font
to view the character, or even when you might be unable to recognize the character. In the right circumstances, they are appropriate in source code or source documents. Character escapes and entities use more space, which makes then unattractive except for occasional use.
Punycode
is required as part of the
domain name
protocols, but not suitable for anything but short strings.
SCSU
was designed for compression of short strings. It uses the least space, but cannot be used
transparently
in most 8-bit environments.
[AF]
Q: Which of these formats is the “most standard”?
All four methods listed in the
answer above
require that the receiver can understand that format, but (a) is the
ASCII
compatible choice among the three otherwise equivalent standard
Unicode Encoding Forms
. It is not only widely supported, but the default format in many cases. The use of pure ASCII together with (b), or (c) outside of their given context in protocols supporting them would definitely be considered non-standard, but could be a good solution for internal data transmission. The use of
SCSU
is technically a standard (for compressed data streams) but few general purpose receivers support SCSU, so it is again most useful in internal or protocol-specific data transmission.
[AF]
UTF-8 FAQ
Q: What is the definition of UTF-8?
UTF-8
is the byte-oriented
encoding form
of Unicode. For details of its definition, see
Section 2.5, Encoding Forms
and
Section 3.9, Unicode Encoding Forms
” in
The Unicode Standard
. See, in particular, Table 3-6
UTF-8 Bit Distribution
and Table 3-7
Well-formed UTF-8
Byte
Sequences
, which give succinct summaries of the encoding form. Make sure you refer to the latest version of the Unicode Standard, as the
Unicode Technical Committee
has tightened the definition of UTF-8 over time to more strictly enforce unique sequences and to prohibit encoding of certain invalid characters. There is an Internet
RFC 3629
about UTF-8. UTF-8 is also defined in Annex D of
ISO/IEC 10646
. See also the question above,
How do I write a UTF converter?
Q: Does it matter for the UTF-8 encoding scheme if the underlying processor is little endian or big endian?
Since
UTF-8
is interpreted as a sequence of
bytes
, there is no endian problem as there is for
encoding forms
that use 16-bit or 32-bit
code units
. Where a
BOM
is used with UTF-8, it is
only
used as an encoding
signature
to distinguish UTF-8 from other encodings — it has nothing to do with byte order.
[AF]
Q: Is the UTF-8 encoding scheme the same irrespective of whether the underlying system uses ASCII or EBCDIC encoding?
There is only one definition of
UTF-8
. It is precisely the same, whether the data were converted from
ASCII
or
EBCDIC
based
character sets
. However,
byte
sequences from standard UTF-8 won’t interoperate well in an EBCDIC system, because of the different arrangements of
control codes
between ASCII and EBCDIC.
UTR #16, UTF-EBCDIC
defines is a specialized
UTF
that will interoperate in EBCDIC systems.
[AF]
Q: How do I convert a UTF-16 surrogate pair such as to UTF-8? As one 4-byte sequence or as two separate 3-byte sequences?
The definition of
UTF-8
requires that
supplementary characters
(those using
surrogate pairs
in
UTF-16
) be encoded with a single 4-byte sequence. However, there is a widespread practice of generating pairs of 3-byte sequences in older software, especially software which pre-dates the introduction of UTF-16 or that is interoperating with UTF-16 environments under particular constraints. Such an encoding is
not conformant
to UTF-8 as defined. See
UTR #26, Compatibility Encoding Scheme for UTF-16: 8-bit (CESU)
for a formal description of such a non-UTF-8 data format. When using CESU-8, great care must be taken that data is not accidentally treated as if it was UTF-8, due to the similarity of the formats.
[AF]
Q: How do I convert an unpaired UTF-16 surrogate to UTF-8?
A different issue arises if an
unpaired
surrogate is encountered when converting ill-formed
UTF-16
data. By representing such an unpaired surrogate on its own as a 3-byte sequence, the resulting
UTF-8
data stream would become ill-formed. While it faithfully reflects the nature of the input, Unicode
conformance
requires that
encoding form
conversion always results in a valid data stream. Therefore a converter
must
treat this as an error.
[AF]
UTF-16 FAQ
Q: What is UTF-16?
UTF-16
uses a single 16-bit
code unit
to encode over 60,000 of the most common characters in Unicode, and a pair of 16-bit code units, called surrogates, to encode the remainder of about 1 million characters in Unicode.
Originally, Unicode was designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with
private-use characters
.) Over time, and especially after the addition of over 14,500
composite characters
for
compatibility
with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16.
[AF]
Q: What are surrogates?
Surrogates are
code points
from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired
code units
in
UTF-16
Leading surrogates
, also called high surrogates, are encoded from D800
16
to DBFF
16
, and
trailing surrogates
, or low surrogates, from DC00
16
to DFFF
16
. They are called surrogates, since they do not represent characters directly, but only as a pair.
Q: What’s the algorithm to convert from UTF-16 to code points?
While the Unicode Standard used to contain a short algorithm, now it provides only a bit distribution table that shows the relation between surrogates and the resulting
supplementary code points
. Here are three short code snippets that translate the information from the bit distribution table into C code that will convert to and from
UTF-16
Using the following type definitions
typedef unsigned int16 UTF16;
typedef unsigned int32 UTF32;
the first snippet calculates
the high (or leading) surrogate from a character code C.
const UTF16 HI_SURROGATE_START = 0xD800
UTF16 X = (UTF16) C;
UTF32 U = (C >> 16) & ((1 << 5) - 1);
UTF16 W = (UTF16) U - 1;
UTF16 HiSurrogate = HI_SURROGATE_START | (W << 6) | X >> 10;
where “X”, “U” and “W” correspond to the labels used in Table 3-5
UTF-16 Bit Distribution
. The next snippet does the same for the low surrogate.
const UTF16 LO_SURROGATE_START = 0xDC00
UTF16 X = (UTF16) C;
UTF16 LoSurrogate = (UTF16) (LO_SURROGATE_START | X & ((1 << 10) - 1));
Finally, the reverse, where hi and lo are the high and low surrogate, and “C” the resulting character
UTF32 X = (hi & ((1 << 6) -1)) << 10 | lo & ((1 << 10) -1);
UTF32 W = (hi >> 6) & ((1 << 5) - 1);
UTF32 U = W + 1;
UTF32 C = U << 16 | X;
A caller would need to ensure that “C”, “hi”, and “lo” are in the appropriate ranges.
[AF]
Q: Is there a simpler way to do the conversion from UTF-16 to code points?
There is a much simpler computation that does not try to follow the bit distribution table.
// constants
const UTF32 LEAD_OFFSET = 0xD800 - (0x10000 >> 10);
const UTF32 SURROGATE_OFFSET = 0x10000 - (0xD800 << 10) - 0xDC00;

// computations
UTF16 lead = LEAD_OFFSET + (codepoint >> 10);
UTF16 trail = 0xDC00 + (codepoint & 0x3FF);

UTF32 codepoint = (lead << 10) + trail + SURROGATE_OFFSET;
[MD]
Q: What are some of the considerations for UTF-16?
UTF-16
sometimes requires two
code units
to represent a single character. It is therefore a variable width encoding, and just like some of the East Asian legacy
character sets
such as
Shift-JIS
SJIS
) code units alternate between two widths. People familiar with these character sets are well acquainted with the problems that variable-width codes can cause. However, there are some important differences between the mechanisms used in SJIS and UTF-16:
Overlap
In SJIS, there is overlap in numerical values between the leading and trailing code units, and between the trailing and single code units. This causes a number of problems:
It causes false matches. For example, searching for an “a” (hex 0x61) may match against the trailing code unit of a Japanese character.
It prevents efficient random access. To know whether you are on a character boundary, you have to search backwards to find a known boundary.
It makes the text extremely fragile. If a unit is dropped from a leading-trailing code unit pair, many following characters can be corrupted.
In UTF-16, the
code point
ranges for high and low surrogates, as well as for single units are all completely disjoint. None of these problems occur:
There are no false matches.
The location of the character boundary can be directly determined from each code unit value.
A dropped surrogate will corrupt only a single character.
Frequency
The vast majority of SJIS characters require 2 units, but characters using single units occur commonly and often have special importance, for example in file names.
With UTF-16, relatively few characters require 2 units. The vast majority of characters in common use are single code units. Even in East Asian text, the incidence of
surrogate pairs
should be well less than 1% of all text storage on average. (Certain documents, of course, may have a higher incidence of surrogate pairs, just as
phthisique
is an fairly infrequent word in English, but may occur quite often in a particular scholarly text.)
The recent increased popularity of
emoji
means that the percentage of widely-used
supplementary characters
has also increased, and with it the support for surrogate pairs.
[AF]
Q: Will UTF-16 ever be extended to more than a million characters?
No. Both Unicode and
ISO 10646
have policies in place that formally limit future code assignment to the integer range that can be expressed with current
UTF-16
(0 to 1,114,111). Even if other
encoding forms
(i.e. other
UTFs
) can represent larger integers, these policies mean that all encoding forms will always represent the same set of characters. Over a million possible codes is far more than enough for the goal of Unicode of encoding characters, not
glyphs
. Unicode is
not
designed to encode arbitrary data. If you wanted, for example, to give each “instance of a character on paper throughout history” its own code, you might need trillions or quadrillions of such codes; noble as this effort might be, you would not use Unicode for such an encoding.
[AF]
Q: Are there any 16-bit values that are invalid?
Unpaired surrogates are invalid in
UTFs
. These include any value in the range D800
16
to DBFF
16
not followed by a value in the range DC00
16
to DFFF
16
, or any value in the range DC00
16
to DFFF
16
not preceded by a value in the range D800
16
to DBFF
16
[AF]
Q: What about noncharacters? Are they invalid?
Not at all.
Noncharacters
are valid in
UTFs
and must be properly converted. For more details on the definition and use of noncharacters, as well as their correct representation in each UTF, see the
Noncharacters FAQ
Q: Because most supplementary characters are uncommon, does that mean I can ignore them?
Most
supplementary characters
(expressed with
surrogate pairs
in
UTF-16
) are not too common. However, that does
not
mean that supplementary characters should be neglected. Among them are a number of individual characters that are very popular, as well as many sets important to East Asian procurement specifications. Among the notable
supplementary characters are:
many popular
emoji
and
emoticons
symbols used for interoperating with Wingdings and Webdings
numerous small sets of
CJK
characters important for procurement, including personal and place names
variation selectors
used for all
ideographic variation sequences
numerous minority scripts important for some user communities
some highly salient historic scripts, such as Egyptian hieroglyphics
Ken Lunde has an interesting presentation file on this topic, with a
Top Ten list: Why Support Beyond-BMP Code Points?
Q: How should I handle supplementary characters in my code?
Compared with
BMP characters
as a whole, the
supplementary characters
occur less commonly in text. This remains true now, even though many thousands of supplementary characters have been added to the standard, and a few individual characters, such as popular
emoji
, have become quite common. The relative frequency of BMP characters, and of the
ASCII
subset within the
BMP
can
be taken into account when optimizing implementations for best performance: execution speed, memory usage, and data storage.
Such strategies are particularly useful for
UTF-16
implementations, where BMP characters require one 16-bit
code unit
to process or store, whereas supplementary characters require two.
Strategies that optimize for the BMP are less useful for
UTF-8
implementations, but if the distribution of data warrants it, an optimization for the ASCII subset may make sense, as that subset only requires a single
byte
for processing and storage in UTF-8.
Q: What is the difference between UCS-2 and UTF-16?
UCS-2
is obsolete terminology which refers to a Unicode implementation up to Unicode 1.1,
before
surrogate code points
and
UTF-16
were added to Version 2.0 of the standard. This term should now be avoided.
UCS-2 does not describe a data format distinct from UTF-16, because both use exactly the same 16-bit
code unit
representations. However, UCS-2 does not
interpret
surrogate code points, and thus cannot be used to conformantly represent
supplementary characters
Sometimes in the past an implementation has been labeled “UCS-2” to indicate that it does not support supplementary characters and doesn’t interpret pairs of surrogate code points as characters. Such an implementation would not handle processing of
character properties
code point
boundaries,
collation
, etc. for supplementary characters, nor would it be able to support most
emoji
, for example.
[AF]
UTF-32 FAQ
Q: What is UTF-32?
Any Unicode character can be represented as a single 32-bit unit in
UTF-32
. This single 4
code unit
corresponds to the
Unicode scalar value
, which is the abstract number associated with a Unicode character. UTF-32 is a subset of the encoding mechanism called
UCS-4
in
ISO 10646
. For more information, see
Section 3.9, Unicode Encoding Forms
in
The Unicode Standard
[AF]
Q: Should I use UTF-32 (or UCS-4) for storing Unicode strings in memory?
It may seem compelling to use
UTF-32
as your internal string format because it uses one
code unit
per
code point
. However, Unicode characters are rarely processed in complete isolation.
Combining character sequences
may need to be processed as a unit, for example. This issue not only affects complex scripts, but also seemingly simple things like
emoji
many of which are defined as combining sequences.
Defining your APIs so they work primarily with strings and substrings, instead of characters and character offsets will make it easier to correctly support combining character sequences. This will also make the distinction between working in UTF-32 and other
encoding forms
less relevant.
The downside of UTF-32 is that it forces you to use 32-bits for each character, when only 21 bits are ever needed. The number of significant bits needed for the average character in common texts is much lower, making the ratio effectively that much worse. Increasing the storage for the same number of characters does have its cost in applications dealing with large volume of text data: it can mean exhausting cache limits sooner; it can result in noticeably increased read/write times or in reaching bandwidth limits; and it requires more space for storage. What a number of implementations do is to represent strings with
UTF-8
or
UTF-16
, but individual character values with UTF-32. For example, an API to retrieve
character properties
might use UTF-32 code units as parameter.
If you frequently need to access APIs that require string parameters to be in UTF-32, it may be more convenient to work with UTF-32 strings all the time. In many situations that does not matter, but the convenience of having a fixed number of code units per character can be the deciding factor.
The chief selling point for Unicode is providing a representation for all the world’s characters, eliminating the need for juggling multiple
character sets
and avoiding the associated data corruption problems. These features were enough to swing industry to the side of using Unicode as in-memory format. While a UTF-32 representation does make the programming model somewhat simpler, the increased average storage size has real drawbacks, making a complete transition to UTF-32 less compelling.
[AF]
Q: How about using UTF-32 interfaces in my APIs?
Except in some environments that store text as
UTF-32
in memory, most Unicode APIs are using
UTF-16
. With UTF-16 APIs the low level indexing is at the storage or
code unit
level, with higher-level mechanisms for
graphemes
or words specifying their boundaries in terms of the code units. This provides efficiency at the low levels, and the required functionality at the high levels.
If its ever necessary to locate the
th
character, indexing by character can be implemented as a high level operation. However, while converting from such a UTF-16 code unit index to a character index or vice versa is fairly straightforward, it does involve a scan through the 16-bit units up to the index point. In a test run, for example, accessing UTF-16 storage as characters, instead of code units resulted in a 10× degradation. While there are some interesting optimizations that can be performed, it will always be slower on average. Therefore locating other boundaries, such as grapheme, word, line or sentence boundaries proceeds directly from the code unit index, not indirectly via an intermediate character code index.
In situation where it is necesary to work with the “units” that the user interacts with, indexing by Unicode character gives only limited advantage over indexing by code unit: many times what users perceive as a single unit, an
emoji
for example, is represented as a combining or other
character sequence
, and it makes little difference in iterating over such “units” whether the underlying code uses 16-bit or 32-bit code units.
Q: Is having only UTF-16 string APIs restrictive, as opposed to having UTF-32 char APIs?
Almost all international functions (upper-, lower-, titlecasing,
case folding
, drawing, measuring,
collation
, transliteration,
grapheme
-, word-, linebreaks, etc.) should take
string parameters
in the API,
not
single
code points
UTF-32
). Single code-point APIs almost always produce the wrong results except for very simple languages, either because you need more context to get the right answer, or because you need to generate a sequence of characters to return the right answer, or both.
For example, any Unicode-compliant collation (See
UTS #10, Unicode Collation Algorithm (UCA)
) must be able to handle sequences of more than one code point, and treat that sequence as a single entity. Trying to collate by handling single code points at a time, would get the wrong answer. The same will happen for drawing or measuring text a single code point at a time; because scripts like Arabic are contextual, the width of
plus the width of
is not equal to the width of
xy
. Once you get beyond basic typography, the same is true for English as well; because of
kerning
and
ligatures
the width of “fi” in the
font
may be different than the width of “f” plus the width of “i”. Casing operations must return strings, not single code points; see
. In particular, the titlecasing operation requires strings as input, not single code points at a time.
Storing a single code point in a struct or class instead of a string, would exclude support for graphemes, such as “ch” for Slovak, where a single code point may not be sufficient, but a
character sequence
is needed to express what is required. In other words, most API parameters and fields of composite data types should
not
be defined as a character, but as a string. And if they are strings, it does not matter what the internal representation of the string is.
Given that any industrial-strength text and
internationalization
support API has to be able to handle sequences of characters, it makes little difference whether the string is internally represented by a sequence of
UTF-16
code units
, or by a sequence of code points (= UTF-32 code units). Both UTF-16 and
UTF-8
are designed to make working with substrings easy by the fact that the sequence of code units for a given code point is unique.
[AF]
Q: Are there exceptions to the rule of exclusively using string parameters in APIs?
The main exception are very low-level operations such as getting
character properties
(e.g.
General Category
or
Canonical
Class in the
UCD
). For those it is handy to have interfaces that convert quickly to and from
UTF-16
and
UTF-32
, and that allow you to iterate through strings returning UTF-32 values (even though the internal format is UTF-16).
Q: How do I convert a UTF-16 surrogate pair such as to UTF-32? As one 4-byte sequence or as two 4-byte sequences?
The definition of
UTF-32
requires that
supplementary characters
(those using
surrogate pairs
in
UTF-16
) be encoded with a single 4-byte sequence.
Q: How do I convert an unpaired UTF-16 surrogate to UTF-32?
If an
unpaired
surrogate is encountered when converting ill-formed
UTF-16
data, any conformant converter
must
treat this as an error. By representing such an unpaired surrogate on its own, the resulting
UTF-32
data stream would become ill-formed. While it faithfully reflects the nature of the input, Unicode
conformance
requires that
encoding form
conversion always results in valid data stream.
[AF]
Byte Order Mark (BOM) FAQ
Q: What is a BOM?
byte order mark
BOM
) consists of the character code U+FEFF at the beginning of a data stream, where it can be used as a
signature
defining the
byte
order and
encoding form
, primarily of unmarked
plain text
files. Under some higher level protocols, use of a BOM may be mandatory (or prohibited) in the Unicode data stream defined in that protocol.
[AF]
Q: Where is a BOM useful?
BOM
is useful at the beginning of files that are typed as text, but for which it is not known whether they are in big or little endian format—it can also serve as a hint indicating that the file is in Unicode, as opposed to in a legacy encoding and furthermore, it acts as a
signature
for the specific
encoding form
used.
[AF]
Q: What does ‘endian’ mean?
Data types longer than a
byte
can be stored in computer memory with the most significant byte (
MSB
) first or last. The former is called
big-endian
, the latter
little-endian
. When data is exchanged, bytes that appear in the "correct" order on the sending system may appear to be out of order on the receiving system. In that situation, a
BOM
would look like 0xFFFE which is a
noncharacter
, allowing the receiving system to apply byte reversal before processing the data.
UTF-8
is byte oriented and therefore does not have that issue. Nevertheless, an initial BOM might be useful to identify the datastream as UTF-8.
[AF]
Q: Is a BOM used only in 16-bit Unicode text?
BOM
can be used as a
signature
no matter how the Unicode text is transformed:
UTF-16
UTF-8
, or
UTF-32
. The exact
bytes
comprising the BOM will be whatever the Unicode character U+FEFF is converted into by that
transformation format
. In that form, the BOM serves to indicate both that it is a Unicode file, and which of the formats it is in. Examples:
Bytes
Encoding Form
00 00 FE FF
UTF-32, big-endian
FF FE 00 00
UTF-32, little-endian
FE FF
UTF-16, big-endian
FF FE
UTF-16, little-endian
EF BB BF
UTF-8
Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If yes, then can I still assume the remaining UTF-8 bytes are in big-endian order?
Yes,
UTF-8
can contain a
BOM
. However, it makes
no
difference as to the endianness of the
byte
stream. UTF-8 always has the same byte order. An initial BOM is
only
used as a
signature
— an indication that an otherwise unmarked text file is in UTF-8. Note that some recipients of UTF-8 encoded data do not expect a BOM. Where UTF-8 is used
transparently
in 8-bit environments, the use of a BOM will interfere with any protocol or file format that expects specific
ASCII
characters at the beginning, such as the use of “#!” of at the beginning of Unix shell scripts.
[AF]
Q: What should I do with U+FEFF in the middle of a file?
In the absence of a protocol supporting its use as a
BOM
and when not at the beginning of a text stream, U+FEFF should normally not occur. For backwards
compatibility
it should be treated as
ZERO WIDTH NON-BREAKING SPACE
(ZWNBSP), and is then part of the content of the file or string. The use of U+2060
WORD JOINER
is strongly preferred over ZWNBSP for expressing word joining semantics since it cannot be confused with a BOM. When designing a markup language or data protocol, the use of U+FEFF can be restricted to that of
Byte Order Mark
. In that case, any U+FEFF occurring in the middle of a file can be treated as an
unsupported character
[AF]
Q: I am using a protocol that has BOM at the start of text. How do I represent an initial ZWNBSP?
Use U+2060
WORD JOINER
instead.
Q: How do I tag data that does not interpret U+FEFF as a BOM?
Use the tag
UTF-16BE
to indicate
big-endian
UTF-16
text, and
UTF-16LE
to indicate
little-endian
UTF-16 text. If you do use a
BOM
, tag the text as simply
UTF-16
[MD]
Q: Why wouldn’t I always use a protocol that requires a BOM?
Where the data has an associated type, such as a field in a database, a
BOM
is unnecessary. In particular, if a text data stream is marked as
UTF-16BE
UTF-16LE
UTF-32BE
or
UTF-32LE
, a BOM is neither necessary nor
permitted
. Any U+FEFF would be interpreted as a ZWNBSP.
Do not tag every string in a database or set of fields with a BOM, since it wastes space and complicates string concatenation. Moreover, it also means two data fields may have precisely the same content, but not be binary-equal (where one is prefaced by a BOM).
Q: How I should deal with BOMs?
Here are some guidelines to follow:
A particular protocol (e.g. Microsoft conventions for .txt files) may require use of the
BOM
on certain Unicode data streams, such as files. When you need to conform to such a protocol, use a BOM.
Some protocols allow optional BOMs in the case of untagged text. In those cases,
Where a text data stream is known to be
plain text
, but of unknown encoding, BOM can be used as a
signature
. If there is no BOM, the encoding could be anything.
Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as
big-endian
Some
byte
oriented protocols expect
ASCII
characters at the beginning of a file. If
UTF-8
is used with these protocols, use of the BOM as
encoding form
signature should be avoided.
Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode
little-endian
), the BOM should not be used. In particular, whenever a data stream is declared to be
UTF-16BE
UTF-16LE
UTF-32BE
or
UTF-32LE
a BOM
must
not be used. (See also
Q: What is the difference between UCS-2 and UTF-16?
.)
[AF]