UTN #6: BOCU-1

UTN #6: BOCU-1
Technical Notes
Unicode Technical Note #6
BOCU-1:
MIME-Compatible Unicode Compression
Version
Authors
Markus W.
Scherer
Mark
Davis
Date
2006-02-04
This Version
Previous Version
Latest Version
Summary
This document describes a Unicode compression scheme that is MIME-compatible (directly usable
for emails) and preserves binary order (for databases and sorted lists). BOCU-1 is a MIME-compatible application of the
Binary
Ordered Compression for Unicode
BOCU
] base algorithm.
Status
This document is a
Unicode
Technical Note
. It is supplied purely for informational purposes and
publication does not imply any endorsement by the Unicode
Consortium. For general information on Unicode Technical Notes, see
Contents
Introduction
Features of BOCU-1
Compression
Binary Order
MIME Compatibility
Details
Signature Byte Sequence
Encoding Algorithm
Sample C Code
BOCU-1 Performance
SCSU vs. BOCU-1
Test Setup
Intellectual Property
References
Modifications
Introduction
The most popular encoding for Unicode text data exchange is UTF-8. It is
relatively simple and widely applicable: MIME/email/HTML/XML, in-process use in
some systems, etc. However, UTF-8 uses more bytes per code point for non-Latin
scripts than language-specific codepages.
In some markets where scripts other than Latin are used, the high bytes/code
point ratio of UTF-8 (and of UTF-16 for scripts other than Latin and CJK
ideographs) has been criticized and used as a motivation to not use Unicode for
the exchange of documents in some languages. Larger document sizes are also
a problem in low-bandwidth environments such as wireless networks for handheld
systems.
The Standard Compression Scheme for Unicode [
SCSU
] was created as a Unicode compression scheme with a byte/code point ratio
similar to language-specific codepages. It has not been widely adopted although
it fulfills the criteria for an IANA charset and is registered as one. [
SCSU
] is not suitable for MIME
“text” media types, i.e.,
it cannot be used directly in emails and similar protocols. [
SCSU
] requires a
complicated encoder design for good performance.
BOCU-1 combines the wide applicability of UTF-8 with the compactness of [
SCSU
].
It is useful for short strings and maintains code point order.
Features of BOCU-1
2.1
Compression
The byte/code point ratio is 1 for runs of code points from the same block of
0x80 code points (and for Hiragana), and 2 for runs of CJK Unihan code points, as with [
SCSU
]. The
startup overhead is very low (similar to [
SCSU
]), which makes it useful for very short strings like
individual names. The maximum number of bytes per code point is four.
With BOCU-1, texts in languages with Latin-based scripts take about the same amount of space as
with UTF-8, while texts in all other languages take about 25%-60% less space. Compared to UTF-16,
texts in all languages with small character repertoires take approximately half as much space in
BOCU-1. For large character sets, i.e., Chinese/Japanese/Korean, texts are about the same size for
UTF-16 and BOCU-1.
2.2
Binary Order
The lexical order of BOCU-1 bytes is the same
as the code point order of the original text — like UTF-8 but unlike [
SCSU
which allows the compression of large, sorted lists of strings. This makes BOCU-1 suitable for use
in databases to reduce storage requirements as described above.
2.3
MIME Compatibility
The C0 control codes NUL, CR and LF and nine others are encoded with the same
byte values as in US-ASCII, and those byte values are used
only
for the
encoding of these twelve C0 control codes. This makes BOCU-1 suitable for MIME “text” media types, directly usable in emails and generally
“friendly” for
ASCII-based tools. The SUB control and its byte value 1A is included in this set
to avoid problems in DOS/Windows/OS/2 systems where 1A is interpreted as an
end-of-file marker in text mode.
2.4
Details
BOCU-1 is byte-oriented and stateful. Its
inter-character state consists of a single integer. It is deterministic, i.e.,
the
same complete input text is always encoded the same way by all encoders (unlike with [
SCSU
]).
Like [
SCSU
] it can be
classified as a Character Encoding Scheme (CES) or as a Transfer Encoding Syntax (TES). It is a “charset” according to
IANA, and it is suitable for
MIME “text” media types.
The state is reset at each C0 control (U+0000..U+001F, includes CR, LF, TAB). CRLF-separated lines do not affect each
other’s encoding. Together with BOCU-1
being deterministic, this allows line-based file comparisons (diff) and makes BOCU-1 usable
with RCS, CVS and other source-code control systems (unlike [
SCSU
]). This also allows some
limited random access.
Byte values for single-byte codes and lead bytes overlap with trail bytes.
So unlike UTF-8, character boundaries cannot be determined in random access, except
by backing up to a reset point. Byte values 7F..9F (DEL and C1 control codes) are used as lead and trail bytes.
US-ASCII characters (code points U+0021..U+007E) are not encoded with the
same bytes as in US-ASCII. Therefore, the charset must be specified with a
signature byte sequence or in a higher-level protocol.
As a single/lead byte, byte value FF is used as a special “reset-state-only” command.
It does not encode any code point. FF is also a normal trail byte. Having a “reset only” command allows simple concatenation of BOCU-1 byte streams.
(All other BOCU-1 byte sequences would append some code point.)
Using FF to reset the state breaks the ordering!
The use of FF resets is discouraged.
Byte stream concatenation without resetting with FF requires to scan back to
a C0 control whose byte value is not used for trail bytes
(last known reset to initial state);
then decode to the end of the first stream and encode the first non-U+0020 code point
of the second stream according to the current state;
then append the rest of the second stream.
The same procedure could be used to remove an FF reset command.
Signature Byte Sequence
An initial U+FEFF is encoded in BOCU-1 with the three bytes FB EE 28.
Encoding Algorithm
The basic algorithm is as described in
Binary
Ordered Compression for Unicode
BOCU
].
BOCU-1 differs from the generic algorithm by using a different set of byte
value ranges and by encoding U+0000..U+0020 directly with byte values 00..20. In
addition, the space character U+0020 does not affect the state. This is to avoid large difference values at the beginning and end of
each word.
Partial pseudo-code for a per-code point encoding function is as follows:
encode(int &prev, int c) {
if(c<=0x20) {
output (byte)c;
if(c!=0x20) {
prev=0x40;
} else {
int diff=c-prev;
// encode diff in 1..4 bytes and output them

// adjust prev
if(c is Hiragana) {
prev=middle of Hiragana;
} else if(c is CJK Unihan) {
prev=middle of CJK Unihan;
} else if(c is Hangul) {
prev=middle of Hangul;
} else {
prev=(c&~0x7f)+0x40;
Sample C Code
The sample C code serves as the full specification of
BOCU-1. Every conformant encoder and decoder must generate equivalent output and
detect any illegal input code points and illegal input byte sequences. Recovery
from illegal input is not specified. Single surrogates are encoded if present in
the input (e.g., unmatched single surrogate code units in UTF-16). Proper input
of supplementary code points (e.g., matched surrogate pairs in UTF-16) must be
encoded by code points.
This code uses
International Components for Unicode
ICU
] standard
headers and the one implementation file
icu/source/common/utf_impl.c
(It is not necessary to link the entire [
ICU
] common library.) This is for
convenience in the surrounding test functions and not necessary for the core
BOCU-1 functions. These headers and implementation file provide the following:
Include inttypes.h or define its types.
Define UChar for UTF-16 as an unsigned 16-bit type (wchar_t or uint16_t).
Define UTF* macros to handle reading and writing of in-process UTF-8/16
strings.
This code is under the
license (ICU version)
ICULicense
].
Files:
bocu1.h
(constants and macros)
bocu1.c
(encoder and decoder functions)
bocu1tst.c
(test code with
main()
function, see below)
A complete, compiled sample
executable for Windows®
from this source code is available for download. Aside from basic implementation
and consistency tests, this also provides file conversion between UTF-8 and
BOCU-1. Use a command-line argument of “?” or “-h” for
usage.
BOCU-1 Performance
This is a performance comparison between BOCU-1,
UTF-8, and [
SCSU
] when implemented as [
ICU
] converters,
which convert to and from UTF-16. The time is for roundtrip conversions from the internal UTF-16 form to the encoding and back.
Values are relative to UTF-8.
BOCU-1
SCSU
Languages
Size of text
Time to convert
to/from UTF-16
Size of text
Time to convert
to/from UTF-16
English/French
100%
160..170%
100%
125%
Greek/Russian/Arabic/Hebrew
60%
65..70%
55%
70%
Hindi
40%
45%
40%
45%
Thai (
see below
45%
60%
40%
55%
Japanese
60%
150%
55%
110%
Korean
75%
155%
85%
70%
Chinese
75%
165%
70%
65%
(Smaller percentages are better. Percentages are rounded to the nearest 5%.)
The compression ratio is smaller for web pages (lots of ASCII in HTML).
The performance difference tends to be smaller for smaller buffers.
When the text is transmitted between machines (emails, web pages),
then the transmission time may swamp the conversion time.
Smaller text will then transmit faster.
6.1
SCSU vs. BOCU-1
SCSU
] and BOCU-1 have similar compression characteristics.
SCSU
] tends to be faster than BOCU-1.
BOCU-1 is applicable in
many
more environments
(it is MIME-friendly, CVS-friendly, and has the same binary order as UTF-8).
The optimized [
ICU
] converter for BOCU-1 is somewhat less complex than the one for [
SCSU
],
and the output is deterministic, unlike with [
SCSU
].
6.2
Test Setup
The texts are the
“What is
Unicode” pages from www.unicode.org
, except for Thai.
Note that english.html contains non-ASCII characters in the index sidebar. The
Thai
text,
th18057.txt
has a different structure: It is a Thai word list from [
ICU
]’s test data, with one Thai word on each line.
Compared to the other texts, it contains only a few characters between CRLF.
This comparison uses full-fledged [
ICU
] converters for UTF-8, [
SCSU
] and BOCU-1.
“Full-fledged [
ICU
] converter” means that this is with the [
ICU
conversion API,
designed for external encodings, as used e.g. by an XML parser or web browser.
The [
ICU
] converter code for [
SCSU
] and BOCU-1 that is tested here
is part of [
ICU
] 2.2.
The [
SCSU
] converter was optimized slightly (conversion function variants without offset handling).
The BOCU-1 converter is optimized compared to the reference code
in the design document. The test machine is a Pentium 2 laptop.
Intellectual Property
Letter regarding licensing of IBM Technology. Hardcopy is
on file with the Chair of the UTC.
For information, see the
Unicode
Consortium Patent Policy
January 23, 2006
Ms. Lisa Moore, UTC Chair
The Unicode Consortium
P.O. Box 391476
Mountain View, CA 94039-1476
VIA: US Mail and e-mail
SUBJECT: Binary Ordered Compression for Unicode (BOCU)
IBM would like to bring to your attention, US Patent 6737994
"Binary-Ordered Compression For Unicode", which may contain claims
necessary to, or which may facilitate the implementation of, BOCU-1. Because
we believe that access to this patent may be important to the successful
implementation of BOCU-1, and although not required to do so by the Unicode
Consortium’s IP policies, IBM would like to offer a royalty free license to
this patent upon request to implementers of a fully compliant version of
BOCU-1.
A party wishing to request a license should contact:
IBM Director of Licensing
North Castle Drive
Armonk, NY 10504
Sincerely,
Marcia Courtemanche
Program Director
Intellectual Property and Standards
References
BOCU
Binary Ordered Compression for Unicode
Describes the base algorithm, of which BOCU-1 is a profile. It
serves as background information and is not necessary for the
specification of BOCU-1.
SCSU
Unicode Technical Standard #6: A Standard Compression Scheme for Unicode
ICULicense
The X license, modified for ICU
ICU
International Components for Unicode
Open-source C/C++ and Java libraries for Unicode and I18N support.
Modifications
The following summarizes modifications from the previous version of this
document.
2008.10.01
Updated stale links in version 2
Added Intellectual Property section and updated URLs
Initial version
© 2002-2006 Markus W. Scherer, Mark Davis. This publication is
protected by copyright, and permission must be obtained from the author and
Unicode, Inc. prior to any reproduction, modification, or other use not
permitted by the
Terms of
Use
Use of this publication is governed by the Unicode
. The
authors, contributors, and publishers have taken care in the preparation of
this publication, but make no express or implied representation or warranty
of any kind and assume no responsibility or liability for errors or
omissions or for consequential or incidental damages that may arise
therefrom. This publication is provided “AS-IS” without charge as a
convenience to users.
Unicode and the Unicode Logo are registered trademarks
of Unicode, Inc. in the United States and other countries.