BOM : Java Glossary

*0-9ABCDEFGHIJKLMNOPQRSTUVWXYZ (all)

BOM
BOMs (Byte Order Marks) are special characters at the beginning of a Unicode file to indicate whether it is big or little endian, in other words does the high or low order byte come first. These codes also tell whether the encoding is 8, 16 or 32-bit. You can recognise Unicode files by their starting byte order marks and by the way Unicode-16 files are half zeroes and Unicode-32 files are three-quarters zeros.
UTF (Unicode Transformation unit)BOM (Byte Order Mark)(Byte Order Mark) Unicode-encoding Endian Indicators
0xfeff BOM
as it appears encoded
Description
ef bb bf UTF-8 endian, strictly speaking does not apply, though it uses big-endian most-significant-bytes first representation.
fe ff UTF-16 for 16-bit internal UCS-2, big endian, Java network order
ff fe UTF-16 for 16-bit internal UCS-2, little endian, Intel/Microsoft order. Note you must examine subsequent bytes to tell this apart from a UTF-32 BOM since they both start ff fe.
00 00 fe ff UTF-32 for 32-bit internal UCS-4, big-endian, Java network order
ff fe 00 00 UTF-32 for 32-bit internal UCS-4, little endian, Intel/Microsoft order.
The actual Unicode character encoded in all cases is 0xfeff.

There are also variants of these encodings that have an implied endian marker.

Unfortunately, often applications, even Javac.exe, choke on these byte order marks. Java Readers don’t automatically filter them out. There is not much you can do but manually remove them.

Avoiding BOM s

How can you get rid of these pesky BOM s? Here are ideas, the ones I consider best/simplest near the top.
  1. Use UTF-8. It does not use them. You can use native2ascii.exe to convert your given encoding to UTF-8.
  2. Write a utility that reads the first character of a file. If it is a BOM, copy the rest of a file to a temp file, then delete the original and rename the temp to the original, effectively permanently chopping off the leading BOM. Unfortunately, this discards the useful information about the encoding of the file. It will be more efficient if you don’t use Readers, but use byte-based InputStreams instead.
  3. Write an encoding UTF-16HIDEBOM that wraps itself around UTF-16 and install it as one of the official encodings.
  4. Write a FilterInputStream that discards BOMs. And use it in your apps.
  5. Lobby Oracle to provide a solution.
  6. Look for the character in your application code and ignore it. This technique is very clumsy and will seriously interfere with your application logic.

TestBOM

This program tests how Java handles BOM s. It discovers than Java never inserts BOM and it never removes them on its own. You have to bypass, insert and delete them explicitly.

Encodings Matter

You would think if there is a BOM at the start of a file, Java could tell all on its own if the file were UTF-8, UTF-16BE or UTF-16LE encoded. However, Java is not clever. You must get the encoding right in the InputStreamReader, or you will just read gibberish and you will not get an error message.

Here is how I discovered this:


This page is posted
on the web at:

http://mindprod.com/jgloss/bom.html

Optional Replicator mirror
of mindprod.com
on local hard disk J:

J:\mindprod\jgloss\bom.html
Canadian Mind Products
Please the feedback from other visitors, or your own feedback about the site.
Contact Roedy. Please feel free to link to this page without explicit permission.

IP:[65.110.21.43]
Your face IP:[18.97.14.89]
You are visitor number