BOMs (Byte Order Marks)
are special characters at the beginning of a Unicode file to indicate whether it is
big or little endian, in other words does the high or low order byte come first.
These codes also tell whether the encoding is 8, 16 or 32-bit. You can recognise Unicode files by their starting byte order
marks and by the way Unicode-16 files are half zeroes and Unicode-32 files are
UTF BOM (Byte Order Mark) Unicode-encoding Endian Indicators
The actual Unicode character encoded in all cases is 0xfeff.
|UTF (Unicode Transformation unit)BOM (Byte Order Mark)(Byte Order Mark) Unicode-encoding Endian Indicators
as it appears encoded
|ef bb bf
|UTF-8 endian, strictly speaking does not apply, though it uses big-endian most-significant-bytes first representation.
|UTF-16 for 16-bit internal UCS-2, big endian, Java network order
|UTF-16 for 16-bit internal UCS-2, little endian, Intel/Microsoft order. Note you must examine subsequent bytes to tell this apart
from a UTF-32 BOM since they both start
|00 00 fe ff
|UTF-32 for 32-bit internal UCS-4, big-endian, Java network order
|ff fe 00 00
|UTF-32 for 32-bit internal UCS-4, little endian, Intel/Microsoft order.
There are also variants of these encodings that have an implied endian marker.
Unfortunately, often applications, even Javac.exe, choke
on these byte order marks. Java Readers don’t
automatically filter them out. There is not much you can do but manually remove
Avoiding BOM s
How can you get rid of these pesky
BOM s? Here are ideas,
the ones I consider best/simplest near the top.
- Use UTF-8. It does not use them. You can use
native2ascii.exe to convert your
given encoding to UTF-8.
- Write a utility that reads the first character of a file. If it is a
BOM, copy the rest of
a file to a temp file, then delete the original and rename the temp to the
original, effectively permanently chopping off the leading
this discards the useful information about the encoding of the file. It will be
more efficient if you don’t use Readers, but use
byte-based InputStreams instead.
- Write an encoding UTF-16HIDEBOM that wraps itself
around UTF-16 and install it as one of the official
- Write a FilterInputStream that discards
BOMs. And use it in
- Lobby Oracle to provide a solution.
- Look for the character in your application code and ignore it. This technique
is very clumsy and will seriously interfere with your application logic.
This program tests how Java handles BOM
s. It discovers than Java never inserts BOM
and it never
removes them on its own. You have to bypass, insert and delete them explicitly.
You would think if there is a BOM
at the start of a file, Java could tell all on its own if the file were
However, Java is not
You must get the encoding right in the InputStreamReader
, or you will just read gibberish and you will not get an error message.
Here is how I discovered this: