UTF : Java Glossary

The JDisplay Java Applet displays the large program listings on this web page. JDisplay requires an up-to-date browser and Java version 1.7+, preferably 1.8.0_05. If you can’t see the listings, or if you just want to learn more about JDisplay, click  here for help.
UTF
UTF (Unicode Transformation unit) are a family of encodings for storing Unicode data on disk. They store either 16-bit or 32-bit Unicode characters.

UTF-8 is especially compact when most of your characters are in the range 0..0x7f (ordinary 7-bit ASCII (American Standard Code for Information Interchange) ). It uses a mixture of 8, 16 and 24-bit codes. UTF-8 and ISO-8859-1 encode 7-bit characters identically, 0x000x7f, but after than that are quite different. To the casual glance, UTF-8 looks like ISO-8859-1 sprinkled with odd combinations three glyphs to stand for characters like “. You need a modern text editor that handles UTF-8 to view it properly.RFC 3629 officially describes the UTF-8 format.

If you’re viewing a file and it contains bits of gibberish beginning with € chances are the file in encoded in UTF-8, but your viewer thinks it is in ISO-8859-1.

UTF-16 normally uses purely 16-bit codes, either big or little endian. It can be extended to also encode 32-bit Unicode.

UTF-32 uses purely 32-bit codes, either big or little endian.

UTF-7 but it encodes 16-bit Unicode using only 7-bit ASCII characters.

Byte Order Marks UTF-32
How UTF-8 Works DataOutputStream.writeUTF
UTF-8 Encoding Exploring
UTF-8 Decoding 32-bit Unicode
UTF-8 Fine Points Debugging
UTF-7 Notepad UTF
UTF-16 Links

Byte Order Marks

There are two different standards, Unicode which assigns glyphs to numbers, and UTF which describes how you encode these numbers in a file. Byte order marks are part of the UTF standard, not the Unicode standard. BOM (Byte Order Mark) s, BOM s) are special characters at the beginning of a Unicode file to indicate whether it is big or little endian, in other words does the high or low order byte come first. These codes also tell whether the encoding is 8, 16 or 32-bit. You can recognise Unicode files by their starting byte order marks, and by the way Unicode-16 files are half zeroes and Unicode-32 files are three-quarters zeros.
UTF BOM (Byte Order Mark) Unicode-encoding Endian Indicators
0xfeff BOM
as it appears encoded
Description
ef bb bf UTF-8 endian, strictly speaking does not apply, though it uses big-endian most-significant-bytes first representation.
fe ff UTF-16 for 16-bit internal UCS-2, big endian, Java network order
ff fe UTF-16 for 16-bit internal UCS-2, little endian, Intel/Microsoft order. Note you must examine subsequent bytes to tell this apart from a UTF-32 BOM since they both start ff fe.
00 00 fe ff UTF-32 for 32-bit internal UCS-4, big-endian, Java network order
ff fe 00 00 UTF-32 for 32-bit internal UCS-4, little endian, Intel/Microsoft order.
The actual Unicode character encoded in all cases is 0xfeff.

There are also variants of these encodings that have an implied endian marker.

How UTF-8 Works

RFC 3629 officially describes the UTF-8 format.
How UTF-8 Encoding Works
Use Range Unicode Bit Assignment UTF-8 Bit Assignment bytes required
to represent the character
in UTF-8
bits required
to represent the character
internally
ASCII 0
..
0x007f
00000000 0xxxxxxx 0xxxxxxx 1 7
Latin, Greek, Hebrew, Arabic 0x0080
..
0x7fff
00000yyy yyxxxxxx 110yyyyy 10xxxxxx 2 11
Asian languages, symbols 0x0800
..
0xffff
zzzzyyyy yyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx 3 16
Ugaritic, musical symbols. CodePoints required to access this range. 0x1_0000
..
0x1f_ffff
00000000 000aaazz zzzzyyyy yyxxxxxx 11110aaa 10zzzzzz 10yyyyyy 10xxxxxx 4 21
future use: range not yet assigned. 0x20_0000
..
0x3ff_ffff
000000bb aaaaaazz zzzzyyyy yyxxxxxx 111110bb 10aaaaaa 10zzzzzz 10yyyyyy 10xxxxxx 5 26

For example:
é Unicode 0x00e9 in UTF-8 is 0xc3a9.
ï Unicode 0x00ef in UTF-8 is 0xc3af.
Unicode 0x20ac in UTF-8 is 0xe282ac.

Sample Code for UTF-8 Encoding

UTF-8 is normally encoded simply with String. getBytesUTF-8 ) or with an OutputStreamWriter, but this is roughly what goes on under the hood, if you ever need to write your own encoder for some non-Java platform, or you are just curious how it works. This code does not handle 32-bit code points embedded in the String. If every you need to handle 32-bit code points(for example, to handle Ugaritic cuneiform) you can extend this code or pay me to do it.

Sample Code for UTF-8 Decoding

UTF-8 is normally decoded simply with new Stringbyte[], UTF-8 ) or with an InputStreamReader but this is roughly what goes on under the hood, if you ever need to write your own decoder for some non-Java platform, or you are just curious how it works. This code does not handle 32-bit code points embedded in the String. If every you need to handle 32-bit code points, you can extend this code or pay me to

UTF-8 Fine Points

Every purely ASCII String is also a valid UTF-8 String. UTF-8 has an optional BOM marker on front of file: ef bb bf. Endianness does not apply to 8-bit encodings, though they can have a BOM.

The UTF8Encoder/UTF8Decoder example classes above do not handle 32-bit characters (aka code points). The IETF (Internet Engineering Task Force) ’s ( RFC 3629 obsolete but has easy-to-understand bit diagrams) and RFC 3629 explain the UTF-8 format.

You can edit or create UTF-8 files with windows notepad.

UTF-7

UTF-7 is encoded like this, I kid you not:

UTF-16

How UTF-16 Encoding Works
Unicode UTF-16 bytes required to represent the character Notes
00000000 yyyyyyyy xxxxxxxx yyyyyyyy xxxxxxxx 2 for numbers in range 0x0000 to 0xffff just encode them as they are in 16 bits.
000zzzzz yyyyyyyy xxxxxxxx 110110zz zzyyyyyy 110111yy xxxxxxxx 4 for numbers above 16 bits, in the range 0x10000 to 0x10ffff, you have 21 bits to encode. This is reduced to 20 bits by subtracting 0x100000. The high order bits are encoded as a 16-bit base 0xd800 + high order 10 bits, and the low order bits are encoded as a 16-bit base 0xdc00 + low order 10 bits. The resulting pair of 16-bit characters are in the so-called so-called high-half zone or high surrogate area (the 210 = 1024-wide band 0xd800-0xdbff) and low-half zone or low surrogate area (the 210 = 1024-wide band 0xdc00-0xdfff). Characters with values greater than 0x10fff cannot be encoded in UTF-16. Values between 0xdc800-0xdbff and 0xd800-0xdfff are specifically reserved for use with UTF-16 for encoding high characters, and don’t have any characters assigned to them.
16-bit Unicode encoding comes in big-endian, little-endian with the endianness marked or implied. UTF-16 is big-endian and must be marked as such with ff ee. UTF-16BE is big-endian unmarked. UTF-16LE is little-endian unmarked. UTF-16 is officially defined in Annex Q of ISO/IEC 10646-1. (Copies of ISO (International Standards Organisation) standards are quite expensive.) It is also described in the Unicode Consortium’s Unicode Standard, as well as in the IETF’s RFC 2781.

You can edit or create UTF-16 files with windows notepad.

Here is how you would encode 32-bit Unicode to UTF-16

UTF-32

32-bit Unicode encoding comes in big-endian, little-endian with the endianness marked or implied. UTF-32 is big-endian and must be marked as such with 00 00 FE FF. UTF-32BE is big-endian unmarked. UTF-32LE is little-endian unmarked.

UTF-32 is not of much practical use since any file using it is mostly zeroes. It is perhaps 3 times as bulky as UTF-8 with nothing special to recommend it. Perhaps it will catch on the conspicuous consumers who designed XML.

Java does not have 32-bit String literals, like C style code points e.g. \U0001d504. Note the capital U vs the usual \ud504 I wrote the SurrogatePair applet to convert C-style code points to arcane surrogate pairs to let you use 32-bit Unicode glyphs in your programs.

DataOutputStream.writeUTF

Sun uses its own a variant of UTF-8 for its DataOutputStream. writeUTF. The output of writeUTF consists of a 16-bit signed big-endian length, measured in bytes, the length of the encoding, not the length in characters, followed by a modified UTF-8 byte encoding of the String. There is no terminating null. Somebody at Sun goofed. This scheme allows a maximum encoded String length of only 32,767 bytes, with no escape mechanism to handle longer Strings. The maximum String length can be as little as a 1/3 of that, 10,922 characters! [Perhaps it is not too late for Sun to commandeer the sign bit as a to-be-continued bit or to treat the value as unsigned to extend the range to 64,535 with Short. MIN_VALUE being used to mean to-be-continued.]

Further, 0x00 is encoded as is 0xc0 0x80 instead of 0x00, to help C from getting confused reading such a file and thinking the 00 meant end-of-string.

But the biggest difference in Sun’s writeUTF variant is in the handling of 32-bit codepoints. Most Java programs don’t use 32-bit codepoints, but if you do, beware! UTF-8 codes them as 4-byte sequences. Sun is coding them as 6-byte sequences! e.g. consider the encoding of 0x10302, standard UTF-8 gives:

f0 90 8c 82
whereas under Sun’s writeUTF scheme it encodes as:
ed a0 80 ed bc 82
What is going on? Internally Sun encodes 32-bit codepoints as a surrogate pair of 16-bit chars, effectively using UTF-16 encoding internally. Instead of undoing the UTF-16 encoding before applying the UTF-8 transform, Sun applies it directly on the surrogate pairs. Surrogate pairs are in the bands 0xd800-0xdbff and 0xdc00-0xdfff. Treated as ordinary characters, these take 3 bytes each to encode each character separately in UTF-8.

Why would Sun use such an inefficient encoding? I believe it is to be backward compatible with datastreams written prior to the introduction of codepoints, where the surrogate pairs were treated as just ordinary data characters. Sun has to be able to read files written by earlier versions of Java. At some point, Sun will have to deprecate writeUTF and invent something else that properly encodes 32-bits codepoints, and has a scheme to handle arbitrarily long Strings, using a variable length count. Alternatively, they could use the sign bit of the count field as an indicator of the new format.

The writeUTF variant shows up in Serialized Objects, RMI (Remote Method Invocation) streams, class file formats…

Exploring Java’s UTF Support

Java supports UTF in four main ways.
  1. DataOutputStream.writeUTF/ DataInputStream/ readUTF: modified UTF-8.
  2. OutputStreamWriter/InputStreamReader: any encoding including UTF-8, UTF-16, UTF-16BE, UTF-16LE.
  3. String.getBytes/ new String: any encoding including UTF-8, UTF-16, UTF-16BE, UTF-16LE.
  4. CharBuffer & CharsetEncoder. encoder/ ByteBuffer & CharsetDecoder. decode : any encoding including UTF-8, UTF-16, UTF-16BE, UTF-16LE.
I wrote this program to discover if Java conformed with the UTF standards. It demonstrates these four ways or writing and reading UTF. It turns out Java conforms, with the exception of writeUTF.

Here is the output of the program if you are curious, but not so curious that you feel compelled to run the program yourself:

view

32-Bit Unicode

Unicode-4 is currently defined for code points 0 .. 0x10ffff, a shade over 20 bits, which is room for 1,114,112 code points. Last revised/verified: 2006-02-26

I bought an early book on Unicode and marvelled at the extravagant number of symbols for every imaginable purpose. I thought surely no font would ever support all this. I thought, they won’t be going beyond 16 bits until rendering technology catches up to let us use the 64,000 symbols they have already provided, which was already over 100 times bigger than fonts of the time were supporting. But before long, the slots 0..0xffff were used up and Unicode had to be expanded to 32 bits.

Personally, I don’t see the point of any great rush to support 32-bit Unicode. The new symbols will be rarely used. Consider what’s there. The only ones I would conceivably use are musical symbols and Mathematical Alphanumeric symbols (especially the German black letters so favoured in real analysis). The rest I can’t imagine ever using unless I took up a career in anthropology, i.e. linear B syllabary (I have not a clue what it is), linear B ideograms (Looks like symbols for categorising cave petroglyphs), Aegean Numbers (counting with stones and sticks), Old Italic (looks like Phoenician), Gothic (medieval script), Ugaritic (cuneiform), Deseret (Mormon), Shavian (George Bernard Shaw’s phonetic script), Osmanya (Somalian), Cypriot syllabary, Byzantine music symbols (looks like Arabic), Musical Symbols, Tai Xuan Jing Symbols (truncated I-Ching), CJK (Chinese-Japanese-Korean) extensions CJK and tags (letters with blank price tags).

I think 32-bit Unicode becomes a matter of the tail wagging the dog, spurred by the technical challenge rather than a practical necessity. In the process, ordinary 16-bit character handling is turned into a bleeding mess, for almost no benefit.

I think programmers should for the most part simply ignore 32-bit and continue using the String class as we always have presuming every character is 16-bits.

Various ingenious and convoluted schemes have been invented to allow gradual and partial migration to 32-bit Unicode.

To allow 32-bit code points in 16-bit Java internal Strings, Sun encoded them using UTF-16, so that chars in the range 0..0xffff ( exclusive of the reserved low and high surrogate bands ) are encoded in 16 bits, and the characters above 0xffff are encoded in 24, 32… bits.

To allow 32-bit code points in UTF-8 streams, UTF-8 was extended to handle them. If you want the details see IETF ’s ( RFC 3629 obsolete but has easy-to-understand bit diagrams) and RFC 3629 to explain the extended UTF-8 format.

You have to laugh at what a Rube Goldberg machine the process of 32-bit encoding and decoding becomes. In order for Java’s classes to encode an internal Strings in UTF-8, it must watch out for embedded 32-bit character encoded with UTF-16, decode them, and then encode them again with 32-bit extended UTF-8. To decode, Java must decode the extended UTF-8 to 32 bits internally, then re-encode to 16 bits.

Perhaps the implementation of String will change at some point in future so that Strings internally are all pure 8-bit, 16-bit or 32-bit characters, rather than containing a variable number of bytes per character as they do now.

Java does not have a 32-bit String literal, like C style code points \U0001d504. I wrote the SurrogatePair applet to convert C-style code points to arcane surrogate pair to let you use 32-bit Unicode glyphs in your programs.

Debugging

To debug problems with UTF-8 or UTF-16 files you need a two types of tools: You need to know what the hex in the file should look like. See above.

You need a tool to see what the codes is the file actually do look like. Try:

You also need a tool to validate the encoding:

available on the web at:

http://mindprod.com/jgloss/utf.html
ClustrMaps is down

optional Replicator mirror
of mindprod.com
on local hard disk J:

J:\mindprod\jgloss\utf.html
logo
Please the feedback from other visitors, or your own feedback about the site.
Contact Roedy.
Blog
IP:[65.110.21.43]
Your face IP:[54.211.138.180]
You are visitor number 60,080.