- Think of float and double as representing physical measurements. No one would
complain if their cabinet maker made a desk 6.000000000001 feet long.
Analogously, don’t complain about the inevitable tiny errors in floating
point arithmetic results e.g. Math.cos( Math.toRadians( 90 )
) not coming out bang on zero. ( If you want perfection, use int, long, BigInteger
or BigDecimal. )
- To interconvert int, long,
float, double, Float
and Double see the Conversion
Amanuensis.
- The computer floating point unit works internally in base 2, binary. The decimal
fraction 0.1 cannot be precisely represented in binary. It is a repeater
fraction 0.00011001100110… It is like the repeater fraction 1/3 = 0.33333
in base 10. When you add 0.333333… to 0.666666… why are you not
astonished to get 0.999999… rather than 1.0, even though you just added 1/3
+ 2/3 to get 1? Yet, with Java floating point you are astonished when you add
0.1 + 0.1 and get something other than 0.2. The same fundamental mathematical
cause is at work. It is God’s fault for not making decimal 0.1 a perfect
fraction in binary notation. Mike Cowlishaw, the creator of NetRexx, blames
computer hardware makers for using binary floating point. He figures computer
chips should use decimal notation like humans. 1/3 is also a repeater in binary,
0.01010101010… so it too will misbehave in Java just the way it does with
decimal arithmetic.
- Floating point is by its nature inexact. It is probably best if you imagined
that after every floating point operation, a little demon came in and added or
subtracted a tiny number to fuzz the low order bits of your result. This is, of
course not true, but if you as newbie act as if it were true, this will lead you
to code conservatively and will keep you out of trouble.
What actually happens is the computer only has 64 bits to work with. It has to
throw away the low order part of any result after every operation. On every
calculation you accumulate a little more roundoff error. In the newer Java’s
the computer is permitted to secretly retain a few extra bits of accuracy during
a short string of calculation, so sometimes they come out more accurate that you
would theoretically expect. (They almost never come out more accurate than a
newbie expects.) To discourage use of these guard bits,
and get reproducibly less accurate results you use the StrictMath
library. Unless you really know what you are doing, you must presume the results
are never precisely bang on. Don’t count on results that in theory should
be integers coming out precisely as integers. Never compare == or !=, check
within a tolerance. Keep in mind when you compare > >=
< <= the values you are comparing may, as a side effect of
calculation, have drifted just above or just below your test limits. Sometimes
you may want to include some slop/tolerance in your limits. Java is somewhat
better than other languages since it specifies strict IEEE rules. Fortunately,
in Java, if a number has a perfect integer floating point representation, and
you divide it by another such number that is a factor, the result is guaranteed
to be a perfect integer representation.
The other source of the fuzz is accumulating roundoff error from cascaded
operations. Further there are errors in trig functions not being bang on 0 or 1
as you expect. This is a result of accumulated roundoff error in evaluating
polynomials and in the polynomial approximations themselves used internally to
compute the trig functions.
- When numbers are converted from float or double
to String for display they may be truncated. Further
the process of converting from binary to decimal introduces even more errors
that were not in the original computation result, possibly because of repeaters —
fractions that cannot be represented exactly in binary or decimal.
- How can you bypass this fundamental inaccuracy?
- Do everything with float/double, but whenever you do a compare, realise there
will be some slop in the answer. So instead of asking
if ( f == 100.00 )
say
if ( f > 99.995 )
- You can do testing for floating point equality like this:
if ( Math.abs( value - target ) < epsilon )
or faster, but more verbose:
if ( value >= target - epsilon && value <= target + epsilon )
when the order of magnitude of target is unknown, you might use some slower code
like this, presuming target is positive:
if ( Math.abs( value - target ) < epsilon * target )
or
if ( value >= target * ( 1 - epsilon ) && value <= target * ( 1+ epsilon ) )
- Use a mixture of ints and float/doubles. Use the ints for your loop counting.
- Instead of incrementing a floating point variable, recreate it from an int loop
variable. e.g. instead of:
f += 0.001;
code
f = i * 0.001;
- Use double instead of float.
This helps but does not totally solve the problem. All integers less than 2^53 (roughly
16 digits) can be exactly representable in double. This does not mean that
results of a calculation that theoretically should be an integer will actually
be one, unless the IEEE spec demands it.
- When you want exact results, you must use ints, longs, BigDecimal
(arbitrary precision decimal fractions) or BigIntegers
(arbitrary precision Integers). Currency is best handled by storing pennies, and
adding a decorative decimal point on display.
- 32-bit floats can accurately represent ints up to 24 bits; 64-bit doubles
represent longs up to 53 bits; 80-bit extended reals (when guard bits are
permitted) represent longs up to 64-bits. This does not mean than necessarily
every operation you where you expect an integer result will always give one.
Arithmetic is guaranteed to be rounded to the nearest possible floating point
representable number. However, as I have said before, not every number in
representable, and in particular fractions 0.1 0.01 etc are not. It is best to
presume it could be a tiny bit off, unless you fully understand the IEEE rules,
and can predict when you can count on accurate results. Happily Math.sqrt
will give precise integer results when they can be precisely represented.
int inside = Float.floatToIntBits ( f );
float f = Float.intBitsToFloat ( bits );
long inside = Double.doubleToLongBits ( d );
double d = Double.longBitsToDouble ( bits );
- The rest of the world says y=sin(x);. Java (and
JavaScript) insists that you say y=Math.sin(x); unless
you are using JDK 1.5+ and you say:
import static java.lang.
Math.sin; at the top.
- The unusual thing about Java in this area is the high precision of its default String
conversion for floating point values. Most languages by default round to some
reasonable number of digits. When you convert a internal binary double number
for display with:
String Double.toString( double )
Java wants to preserve every drop of precision it has, so that if you convert it
back with:
double Double.parseDouble( String )
you will get back to the precise same binary representation. This is all very
well, but from the point of view of humans, the display looks wrong, as if
result were slightly inexact and with way too many digits. This also makes the
rounding error distressingly visible. To fix it, and create something that looks
pleasingly rounded to humans, use java.text.DecimalFormat
to limit the number of digits displayed to what you actually want/need.
If you are merely trying to round for internal purposes use Math.round,
Math.floor, Math.ceil and Math.rint.
Math.round gives a long
result, the rest double. You can create variants by
adding .5 and multiplying and dividing by powers of ten. There is generally no
need to convert to int or long
and back.
- Not all implementations conform to the accuracy specs.
In most computers, floating point arithmetic is usually much slower than integer
arithmetic, though on the Intel Pentium it is usually faster because the integer
unit was not given the same care as the floating point unit.
| Comparison of Pentium Floating Point and Integer Speeds |
| Operation |
Floating Point clocks |
Integer Clocks |
| add |
1-3 |
1-3 |
| multiply |
1-3 |
10-11 |
| division |
39-42 |
22-46 |
| convert |
6 (double to long) |
3 (long to double) |
Floating point is slower still when you consider the overhead of converting
between the combined int/floating point stack in the JVM and the separate stacks
in the Pentium hardware.
- Double precision arithmetic has very little speed penalty on modern CPUs.
Normally you should use double in preference to float.
It gives you 14 to 15 significant digits where float
gives only 6 to 7. The only advantage to float is compactness. In comparison, a
typical scientific pocket calculator will give you 10 significant digits, and
will automatically round for display. When do you use float
and when double? It depends how much precision you
need. Because modern floating point hardware is all built on double,
normally the only time you use float is for float[]
or for float fields in plentiful Objects
to conserve space when you don’t need the precision.
- If you divide by 0/0, or overflow the maximum representable
number, you won’t get an exception. You will get a special Not a
Number marker value called Double.NaN
or Float.NaN. Use Double.isNaN()
not == Double. NaN
to test for it. NaNs are strange beasts; they were
included in the IEEE 754 standard so that the arithmetic would be algebraically
complete, that is, so every operation on any set of arguments would have some
IEEE 754 value as a result. NaNs therefore don’t
follow the same rules as numeric values, even infinities. Having
NaN != NaN be
true gives an easy and definitive way to test for a NaN. Infinities are much
more familiar mathematically; there wouldn’t be a benefit to having Double.
POSITIVE_INFINITY != Double.
POSITIVE_INFINITY.
Other odd results you can get from floating point operations include Double.
NEGATIVE_INFINITY, Double.
POSITIVE_INFINITY and positive and negative 0.
There is a similar method for checking for infinity, Double.
isInfinite.
0./0. gives NaN, 1./0.
gives POSITIVE_INFINITY and -1./0.
gives NEGATIVE_INFINITY.
- To study the IEEE format, you can use Double.doubleToLongBits
and Double.longBitsToDouble.
- Math.pow
works by taking a natural logarithm, multiplying and then taking the exp (anti-log).
The log and exp are each computed by the CPU floating point microcode with a fat
polynomial approximation (a great mess of multiplies and adds). This means if
you want to compute a cube, you are best to use:
double y = x * x * x;
rather than:
double y = Math.pow( x, 3 );
Ditto for squaring. Further, if x is an integer, you will get a precise result
using only simple arithmetic. If you use: Math.pow
you will lose precision. Also use the specialised
double y = Math.sqrt( x );
in preference to the general
double y = Math.pow( x, 0.5 );