International Components for Unicode - W

International Components for Unicode - Wikipedia
Jump to content
From Wikipedia, the free encyclopedia
Software library
International Components for Unicode
Developer
Unicode Consortium
Initial release
1999
Stable release
78.3
/ 17 March 2026
; 37 days ago
17 March 2026
Written in
C++
C++11
) and
Java
8+
Operating system
Cross-platform
Type
Libraries
for
Unicode
and
internationalization
License
Unicode License
Website
icu
.unicode
.org
Repository
github
.com
/unicode-org
/icu
International Components for Unicode
ICU
) is an
open-source
project of mature
C++
and
Java
libraries for
Unicode
support, software
internationalization
, and software globalization. ICU is widely portable to many operating systems and environments. It gives applications the same results on all platforms and between C, C++, and Java software. The ICU project is a technical committee of the
Unicode Consortium
and sponsored, supported, and used by
IBM
and many other companies.
ICU has been included as a standard component with
Microsoft Windows
since
Windows 10
version 1703.
ICU provides the following services:
Unicode
text handling, full character properties, and
character set
conversions; Unicode
regular expressions
; full Unicode sets; character, word, and line boundaries; language-sensitive
collation
and searching;
normalization
, upper and lowercase conversion, and script
transliterations
; comprehensive
locale
data and resource bundle architecture via the
Common Locale Data Repository
(CLDR); multiple
calendars
and
time zones
; and rule-based formatting and parsing of dates, times, numbers, currencies, and messages. ICU provided
complex text layout
service for Arabic, Hebrew, Indic, and Thai historically, but that was deprecated in version 54, and was completely removed in version 58 in favor of
HarfBuzz
ICU provides more extensive internationalization facilities than the standard libraries for C and C++. Future ICU 75 planned for April 2024 will require
C++17
(up from
C++11
) or
C11
(up from C99), depending on what languages is used. ICU has historically used
UTF-16
, and still does only for Java; while for C/C++
UTF-8
is supported,
including the correct handling of "illegal UTF-8".
ICU 73.2 has improved significant changes for
GB18030
-2022 compliance support, i.e. for Chinese (that updated Chinese GB18030
Unicode Transformation Format
standard is slightly incompatible); has "a modified character conversion table, mapping some GB18030 characters to Unicode characters that were encoded after GB18030-2005" and has a number of other changes such as improving Japanese and Korean short-text line breaking, and in "English, the name “Türkiye” is now used for the country instead of “Turkey” (the alternate spelling is also available in the data)."
ICU 74 "updates to Unicode 15.1, including new characters, emoji, security mechanisms, and corresponding APIs and implementations. [..]
ICU 74 and CLDR 44 are major releases, including a new version of Unicode and major locale data improvements."
Of the many changes some are for person name formatting, or for improved language support, e.g. for
Low German
, and there's e.g. a new spoof checker API, following the (latest version)
Unicode 15
.1.0 UTS #39: Unicode Security Mechanism.
Older version details
edit
ICU 72 updated to
Unicode 15
(and 73.2 to latest 15.1). "In many formatting patterns, ASCII
spaces
are replaced with Unicode spaces (e.g., a "
thin space
")." ICU (ICU4J) now requires Java 8 but "Most of the ICU 72 library code should still work with Java 7 / Android API level 21, but we no longer test with Java 7."
10
ICU 71 added e.g. phrase-based line breaking for Japanese (earlier methods didn't work well for short Japanese text, such as in titles and headings) and support for Hindi written in Latin letters (hi_Latn), also referred to as "
Hinglish
". ICU 70 added e.g. support for
emoji
properties of strings and can now be built and used with
C++20
compilers (and "ICU operator==() and operator!=() functions now return bool instead of UBool, as an adjustment for incompatible changes in C++20"),
11
and as of that version the minimum Windows version is
Windows 7
. ICU 67 handles
removal of Great Britain from the EU
. ICU 64.2 added support for Unicode 12.1, i.e. the single new symbol for current Japanese
Reiwa era
(but support for it has also been backported to older ICU versions down to ICU 4.8.2). ICU 58 (with Unicode 9.0 support) is the last version to support older platforms such as
Windows XP
and
Windows Vista
. Support for
AIX
Solaris
and
z/OS
may also be limited in later versions (i.e. building depends on compiler support).
12
Origin and development
edit
After
Taligent
became part of
IBM
in early 1996,
Sun Microsystems
decided that the new Java language should have better support for internationalization. Since Taligent had experience with such technologies and were close geographically, their Text and International group were asked to contribute the international classes to the
Java Development Kit
as part of the
JDK
1.1 internationalization
APIs
13
A large portion of this code still exists in the
java.text
and
java.util
packages. Further internationalization features were added with each later release of Java.
The Java internationalization classes were then ported to C++ and C
14
as part of a library known as ICU4C ("ICU for C"). The ICU project also provides ICU4J ("ICU for Java"), which adds features not present in the standard Java libraries. ICU4C and ICU4J are very similar, though not identical; for example, ICU4C includes a Regular Expression API, while ICU4J does not. Both frameworks have been enhanced over time to support new facilities and new features of Unicode and
Common Locale Data Repository
(CLDR).
ICU was released as an open-source project in 1999 under the name IBM Classes for Unicode. It was later renamed to International Components For Unicode.
15
In May 2016, the ICU project joined the Unicode consortium as technical committee
ICU-TC
, and the library sources are now distributed under the Unicode license.
16
MessageFormat
edit
A part of ICU is the
MessageFormat
class, a formatting system that allows for any number of arguments to control the plural form (
plural
selectordinal
) or more general
switch-case
-style selection (
select
) for things like
grammatical gender
. These statements can be nested.
17
ICU MessageFormat was created by adding the plural and selection system to an identically-named system in
Java SE
Alternatives
edit
An alternative for using ICU with
C++
, or to using it directly, is to use Boost.Locale, which is a C++ wrapper for ICU (while also allowing other backends
18
). The claim for using it rather than ICU directly is that "is absolutely unfriendly to C++ developers. It ignores popular C++ idioms (the STL, RTTI, exceptions, etc), instead mostly mimicking the Java API."
19
20
Another claim, that ICU only supports UTF-16 (and thus a reason to avoid using ICU) is no longer true with ICU now also supporting UTF-8 for C and C++.
See also
edit
Apple Advanced Typography
Apple Type Services for Unicode Imaging
gettext
Graphite (smart font technology)
NetRexx
(ICU license)
OpenType
Pango
Uconv
Uniscribe
References
edit
unicode-org.
"Release ICU 78.3 · unicode-org/icu"
. Retrieved
18 March
2026
"ICU - International Components for Unicode"
site.icu-project.org
. Archived from
the original
on 2021-08-27
. Retrieved
2011-11-14
Chen, Raymond (27 May 2021).
"How can I convert between IANA time zones and Windows registry-based time zones?"
The Old New Thing
Microsoft
"Layout Engine - ICU User Guide"
userguide.icu-project.org
"UTF-8"
ICU Documentation
. Retrieved
2022-05-24
"UTF-8 - ICU User Guide"
userguide.icu-project.org
. Retrieved
2018-04-03
"#13311 (change illegal-UTF-8 handling to Unicode "best practice")"
bugs.icu-project.org
. Retrieved
2018-04-03
"ICU - International Components for Unicode - ICU 73"
icu.unicode.org
. Retrieved
2023-09-24
"ICU - International Components for Unicode - ICU 74"
icu.unicode.org
. Retrieved
2023-11-29
"ICU - International Components for Unicode - ICU 72"
icu.unicode.org
. Retrieved
2023-01-24
"ICU - International Components for Unicode - ICU 70"
icu.unicode.org
. Retrieved
2023-01-24
"Download ICU 64 - ICU - International Components for Unicode"
site.icu-project.org
. Retrieved
2019-10-20
Laura Werner (1999).
"Getting Java ready for the world: A brief history of IBM and Sun's internationalization efforts"
. Archived from
the original
on 2021-11-17
. Retrieved
2007-05-23
"ICU User Guide"
userguide.icu-project.org
"ICU Project Management Committee"
. Archived from
the original
on 2021-08-28
. Retrieved
2012-08-17
"ICU joins the Unicode Consortium"
Unicode, Inc.
2016-05-16
. Retrieved
2016-08-01
"Formatting Messages"
ICU User Guide
"Boost.Locale: Using Localization Backends"
www.boost.org
. Retrieved
2022-05-24
"Boost.Locale: Design Rationale"
www.boost.org
. Retrieved
2022-05-24
"ICU vs Boost Locale in C++"
Stack Overflow
. Retrieved
2022-05-24
External links
edit
Official website
International Components for Unicode transliteration services
ICU Editor
with Visual Preview
Unicode
Unicode
Unicode Consortium
ISO/IEC 10646 (Universal Character Set)
Versions
Code
points
Block
List
Universal Character Set
Character charts
Character property
Plane
Private Use Area
Pairs
Combining character
Compatibility characters
Duplicate characters
Equivalence
Homoglyph
Precomposed character
list
Z-variant
Variation sequences
Regional indicator symbol
Emoji skin color
Characters
Special
purpose
BOM
Combining grapheme joiner
Left-to-right mark
Right-to-left mark
Soft hyphen
Variant form
Word joiner
Zero-width joiner
Zero-width non-joiner
Zero-width space
Lists
Characters
CJK Unified Ideographs
Combining character
Duplicate characters
Numerals
Scripts
Spaces
Symbols
Halfwidth – fullwidth
Alias names – abbreviations
Whitespace characters
Processing
Algorithms
Bidirectional text
Collation
ISO/IEC 14651
Equivalence
Variation sequences
International Ideographs Core
Encoding
comparison
BOCU-1
CESU-8
Punycode
SCSU
UTF-1
UTF-7
UTF-8
UTF-16/UCS-2
UTF-32/UCS-4
UTF-EBCDIC
Use
Domain names (IDN)
Email
Fonts
HTML
entity references
numeric references
Input
International Ideographs Core
Related
standards
Common Locale Data Repository
(CLDR)
GB 18030
ISO/IEC 8859
DIN 91379
ISO 15924
Related
topics
Anomalies
ConScript Unicode Registry
Ideographic Research Group
International Components for Unicode
People involved with Unicode
Han unification
Scripts
and symbols in Unicode
Scripts
Common,
inherited
Combining marks
Diacritics
Punctuation marks
Spaces
Numbers
Modern
Adlam
Arabic
Armenian
Balinese
Bamum
Batak
Bengali
Beria Erfe
Bopomofo
Braille
Buhid
Burmese
Canadian Aboriginal
Chakma
Cham
Cherokee
CJK Unified Ideographs (Han)
Cyrillic
Deseret
Devanagari
Garay
Geʽez
Georgian
Greek
Gujarati
Gunjala Gondi
Gurmukhi
Gurung Khema
Hangul
Hanifi Rohingya
Hanja
Hanunuoo
Hebrew
Hiragana
Javanese
Kanji
Kannada
Katakana
Kayah Li
Khmer
Kirat Rai
Lao
Latin
Lepcha
Limbu
Lisu (Fraser)
Lontara
Malayalam
Masaram Gondi
Mende Kikakui
Medefaidrin
Miao (Pollard)
Mongolian
Mru
N'Ko
Nag Mundari
New Tai Lue
Nüshu
Nyiakeng Puachue Hmong
Odia
Ol Chiki
Ol Onal
Osage
Osmanya
Pahawh Hmong
Pau Cin Hau
Pracalit (Newa)
Ranjana
Rejang
Samaritan
Saurashtra
Shavian
Sinhala
Sorang Sompeng
Sundanese
Sunuwar
Syriac
Tagbanwa
Tai Le
Tai Tham
Tai Viet
Tai Yo
Tamil
Tangsa
Telugu
Thaana
Thai
Tibetan
Tifinagh
Tirhuta
Tolong Siki
Toto
Vai
Wancho
Warang Citi
Yi
Ancient,
historic
Ahom
Anatolian hieroglyphs
Ancient North Arabian
Avestan
Bassa Vah
Bhaiksuki
Brāhmī
Carian
Caucasian Albanian
Coptic
Cuneiform
Cypriot
Cypro-Minoan
Dives Akuru
Dogra
Egyptian hieroglyphs
Elbasan
Elymaic
Glagolitic
Gothic
Grantha
Hatran
Imperial Aramaic
Inscriptional Pahlavi
Inscriptional Parthian
Kaithi
Kawi
Kharosthi
Khitan small script
Khojki
Khudawadi
Khwarezmian
(Chorasmian)
Linear A
Linear B
Lycian
Lydian
Mahajani
Makasar
Mandaic
Manichaean
Marchen
Meetei Mayek
Meroitic
Modi
Multani
Nabataean
Nandinagari
Ogham
Old Hungarian
Old Italic
Old Permic
Old Persian cuneiform
Old Sogdian
Old Turkic
Old Uyghur
Palmyrene
ʼPhags-pa
Phoenician
Psalter Pahlavi
Runic
Sharada
Siddham
Sidetic
Sogdian
South Arabian
Soyombo
Sylheti Nagri
Tagalog (Baybayin)
Takri
Tangut
Todhri
Tulu Tigalari
Ugaritic
Vithkuqi
Yezidi
Zanabazar Square
Notational
Duployan
SignWriting
Symbols,
emojis
Cultural, political, religious symbols
Currency symbols
Control Pictures
Mathematical operators, symbols
Glossary
Phonetic symbols (including IPA)
Emoji
Category: Unicode
Category: Unicode blocks
Retrieved from "
Categories
Unicode
Component-based software engineering
Digital typography
Pattern matching
Internationalization and localization
Free computer libraries
Hidden categories:
Articles with short description
Short description is different from Wikidata
International Components for Unicode
Add topic