Converting binary data into printable gibberish so that data transport
systems will not corrupt it. You see it used often in certificates, email, and
HTTP communications.
There are many data transport systems that either ignore, act on or otherwise
meddle with control characters embedded in the data. They may trim trailing
blanks, change line end characters, convert tabs to spaces etc. etc. Any of
these actions would totally corrupt binary data. To pass binary data through
such a meddlesome channel, e.g. the email system, it must first be armoured,
converted to use only safe printable characters that
will not be meddled with, e.g. a-z A-Z 0-9 and the
vanilla punctuation. I sometimes refer to character than need special processing
to pass through a channel as awkward.
MIME email and email attachments have a configurable encoding scheme, controlled
via the Transfer-Content-Encoding mime header, often
base64 or Quoted-Printable.
Unfortunately this bulks the message up by 30 to 300%
depending on the technique you use. The other end has to recognise the armouring
technique and do the reverse to get the binary back.
When 8-bit data are encoded in printable characters, the more printable
characters used in the representation, generally the more efficient the protocol.
However, the more characters used, the greater the odds one of the characters
used will be interfered with by your communication channel.
Armouring Schemes
Unfortunately, there are a plethora of techniques. It is not always obvious just
from looking which was used to encode the data:
- base64: common in certificates, passwords, email,
email attachments, cookies and HTTP POSTs. Used to armour bytes or anything that
can be converted to bytes, e.g. via serialized ObjectStreams.
Base64 uses an small cast of characters to convert 8-bit data into printable
characters: a to z, A
to Z, 0 to 9,
+ / and =.
You might do this to convert any binary data to printable. This makes base64
suitable for encoding binary data as SQL strings, that will work no matter what
the encoding. Unfortunately + / and =
all have special meaning in URLs. See Base64
for free Java source code. Every three characters in the original fluff up to
four characters in the encoded form. This 33%
increase in size occurs independent of what characters appear in your data. At
the receiving end you convert the printable characters back to the 8-bit data.
- url-encoded. See the separate entry on it.
- base64u: A variant of Base64 that avoids the + / and
= characters that have special meaning in URLs, GET and POST. You can treat its
output either as not needing URLEncoding, or as already URLEncoded. Used to
armour bytes or anything that can be converted to bytes, e.g. via serialized ObjectStreams.
- the Transporter which optionally
handles serialising/reconstituting, compression/decompression, signing/verifying,
heavy duty encryption/decryption and Base64u armouring/dearmouring all with
light weight classes.
Use it when you want to include arbitrary Java Objects in your CGI GETS and
POSTS.
- Quoted-Printable (RFC 2045)
used in newsgroup messages and email. Quoted-Printable (RFC 2045)
uses the following set of characters to convert 8-bit data into printable
characters : space, a to z,
A to Z, 0
to 9, !- <,
>- ~, =.
It converts unsafe characters into =FF where FF is
the hex equivalent. In the best case, your message is the same size as the
original. In a pathological case, your message can balloon up to three times the
original size.
- hexadecimal: two characters per byte 0..F. The result is
always exactly double the size of the original. This is one of the easiest
schemes to write code for.
- binhex: a hex variant used on the Macintosh.
- UUEncode: similar to Base64 in that they both use 64
ASCII characters to represent 6 bits in the printable representation, but they
are not compatible. Base 64 uses upper case, lower case, digits and only three
punctuation symbols. UUEncode uses 28 punctuation symbols and it uses only upper
case letters. Also, the uuencode command has a structure to its output, with a
header containing a file name and permissions, line-length encoding characters,
and a footer, none of which are part of Base64.
- CMP Encode:
dates back to 1985. Very efficient for text that is
mostly printable already. CMP Encode uses the full 95 ASCII printable characters
excluding space. Printable characters it leaves as is. It encodes control
characters with a lead ^, e.g. code 3 becomes ^C.
High bit chars are encoded with a lead `. It has a
simple compression scheme for repeating character strings. In the best case,
your message can be even smaller than the original. In a pathological case, your
message can balloon up to twice the original size. Unfortunately, Java code for
this algorithm is not currently available. Pascal source
and executable is available. This algorithm is not a recognised official
MIME encoding.
- CMP Encrypt:
dates back to 1985. Also encrypted with a
theoretically uncrackable one-time pad. Pascal source
and executable is available.