For Strings, Hashtables work pretty much as you would expect if you are familiar with hashing. The only catch
is remembering to import java.util.Hashtable (with a small t in table). Here is how you
use
The key point for novices to remember is the capacity of the table has to be somewhat larger than the number of
elements you plan to look up. The more spare capacity you give, the faster the lookup. There are of course
diminishing returns. Typically, you specify about 25% more capacity than you have
elements and a load factor of 0.75 which controls automatic growth of the lookup table when it gets too full.
More details on that later.
The capacity of a table is not the number of elements you plan to add! If you
specify a capacity of 100 and a load factor of .75, Hashtable will allocate only 100
slots, not the 133 you need for efficient access. The load factor kicks in only when deciding when it is time to
double the size of the table, in this case, just after you have added 75 elements, not in deciding its initial
size. It is up to you to leave some breathing room.
Tips
- Hashtables work fastest if the capacity is a prime number since it reduces collisions. Your hashcodes are
unlikely to have a frequency distribution pattern based on modulo a large prime number, where they easily could
have one based on a power of two. When you use a power of two, you are effectively masking off all but the low
order bits. In Java 1.4.1+, for HashMap there no longer a payback for specifying
prime table lengths because HashMap rounds whatever capacity you specify up to a
power of two, to avoid the cost of a division (they can do an AND bit mask instead) and to help keep memory
from being fragmented. However, Hashtable, which is synchronized, uses the value you
give it, so giving it a prime is still worth while.
Now it is more important than ever to have a high quality hashCode function
that has good distribution, especially in the low order bits. It must not favour one bit pattern over
another, e.g. not tend to generate hashcodes all ending in binary 0000. In JDK (Java Development Kit) 1.4.1+ there in a rehash function used inside HashMap to improve the quality of your provided hashCode by
redistributing the bits. rehash has to be quick, or they might as well have used
tables whose lengths are multiples of primes and used the modulus operator.
In theory though, you should round your capacity up to prime number. However, avoid 31 or multiples of 31
since the new hashCode algorithm for Strings inJava version 1.2 or later uses 31 as a magic
number in the hashing formula. In a Hashtable, you are only using 4 bytes per additional entry. Here are some
primes just larger than various powers of two that would be good choices.
| 11 |
17 |
37 |
67 |
131 |
257 |
521 |
1031 |
2053 |
4099 |
8209 |
16411 |
32771 |
65537 |
131101 |
262147 |
524309 |
1048583 |
2097169 |
4194319 |
8388617 |
- You cannot have elements with duplicate keys. If you add a new element that duplicates the key of an
existing element, it will replace it.
- The initial capacity is not the number of elements you plan to put in the table. If you specified a
capacity of 101 and a load factor of .75f, then the initial table could hold only 75 elements, before it had to
grow. This means you must specify plenty of slop in your capacity setting. If you used a load factor of 1.0,
your performance would be slow because you would have many collisions rather than direct lookups.
- Hashtables will automatically grow when you add too many elements. However, growing requires copying,
rehashing and rechaining, quite an involved process. That is why you want an accurate estimate for the initial
capacity, slightly on the high side.
- A load factor of .75f means that the table will be expanded when it gets three quarters full. Hashing works
faster when the table is not full. A value of .25f would use more RAM (Random Access Memory) but would work faster. A value of 1.0f
would use minimal RAM but would run extremely slowly. The reason you want a decent supply of unusued slots is
to reduce collisions, two entries that hash to the same slot. They have to be resolved by chasing a linear
chain. If there are very few empty slots left, and new entries are assigend effectively random slots, the odds
are they will be assigned a slot already taken. If there is almost no room left, the chains of elements that
hash to the same slot rapidly grow longer. That is why you have a load factor, to grow the entire structure
when space gets tight, and keep the chains of elements that collide on the same slot as short as possible.
- HashCodes are usually not unique. If you think about it, how could they be? Consider Strings of length 10
characters (20 bytes, 160 bits). How many different possible strings are there?
2^160. HashCodes are only 32 bits long. How many possible different hashCodes
are there? 2^32. You can see there are way more strings that HashCodes, so strings would have to share
HashCodes.
- If two keys hash to the same hashCode, or to the same hashCode modulo the size of the table, all still
works. It just takes a little longer to look up that key, since get() has to chase a chain of clashing keys to find the match. Keys are matched with
the equals() method, not ==, or simply by comparing hashCodes.
- Let me repeat this another way since it is a point people seem to miss on first reading. Collisions happen
not only when two of the HashCodes are identical, (quite infrequent) but when two of
the Hashcodes modulo the initial capacity are equal (which almost always happens at least a few times).
Collisions are unavoidable. The Hashtable class deals with them automatically. They
slow down lookup, but do not derail it. The smaller the load factor, the more RAM you waste, but the fewer the
collisions, so the faster the lookup. Using prime numbers for the divisor/capacity also tends to reduce
collisions.
- This is the same math that comes up when you ask how many people do you need to have in a room for it to be
likely that two people have the same birthday. There’s a > 50% chance
with 23 people which is a bit more than the square root of 365.
- The source code for Hashtable is very short. It takes an order of magnitude more
English to describe how it works than does the Java code itself. I suggest you peruse it if you are in the
least curious how it works inside. Hashing is something every programmer should understand. You might
also study String.hashCode().
- Hashtables use generic objects or both keys and values. You need to cast these
back to Strings, or whatever you are using.
- If you want to use something other than Strings as your keys, make sure you implement an equals() and hashCode()method for those key objects to override the default Object equals() and hashCode() methods. The default equals() method just checks that the two objects are one in the same (using ==), not
that they consist of the same bit patterns.
- There are two methods for enumerating the Hashtable, keys() which gives you the keys, and elements() which gives you the values. You could step through both enumerations
simultaneously instead of using get() as I showed. JDK
1.2 HashMap (the unsynchronized replacement for
Hashtable) lets you iterate over keys, values or key-value pairs. Note, the enumeration is not sorted in
any particular order, not even the order you added your elements. You need either keyset().iterator(), values().iterator() or entryset().iterator() for pairs. For sorted Hashtables, you need to sort the output yourself
with Arrays.sort( hashtable.values().toArray( new X[hashtable.size()]) ) or use a
TreeMap.
- The size method keeps track of how many unique keys are in the Hashtable, not
the size of the table including empty slots. Hashtables never contain items with duplicate keys.
- Wherever possible use Vectors or ArrayLists in
preference to Hashtables when you can assign a unique dense int key. Vectors are much faster to look up when you know the
index. The Vector.get index lookup process does not scan through other objects,
(dragging them from backing store into real RAM). Thus Vectors even moreso are
better than Hashtables when your virtual RAM is not backed totally by real RAM.
Vector.contains in contrast does a linear search, comparing each object in turn.
This is much slower than a Hashtable lookup except for very small tables. If you are
single thread, use ArrayList in preference to Vector. You
can build a free slot chain through the embedded unused Vector slots to rapidly
recycle numbers.
- Never disturb fields in the Hashtable/HashMap/
HashSet key objects that would change the hashCode or the results of equals.
Hashtable requires these to sit still for it to work. You can change them, but only when the Objects are not
actively being used as keys.
- Hashtable and HashMap do not cache a recent
get. So if you frequently fetch the same item over and over, you might consider
keeping a copy of the previous get result.
- When you serialize a Hashtable, it just outputs key-entry pairs, none of the lookup mechanism. When your load it back again
it rehashes and rebuilds the lookup structure.
- Serializing and generics do not mesh well. To get rid of all warning messages and to
ensure type safety you are best to output an array and reconstitute the Hashtable yourself.
- Hashtable is a Map, not a Collection,
though it is considerd part of the Java Collections Framework.
However, the result of Hashtable. values and Hashtable. keySet is a
Collection. You can use Collection. toArray to export either to an array.
Dealing With Duplicates
Hashtables don’t permit duplicates. What do you do if some of your objects have duplicate keys? When you
have duplicates, what do you want to find when you do a lookup? Just the first, just the last? both? If you are
clever, you ca get all those behaviours with a Hashtable.
To get the first duplicate, before you put, do a check to see if the element is
already in the Hashtable, if so, do nothing.
To get the last duplicate, just put. It will replace what was there before. To get
both, store an ArrayList as the value. Store each of the duplicates in one of the
slots of the ArrayList. This is a bit inefficient for non-duplicates. For them, you
could store just the plain Object. Any you might handle the special case of just one
duplicate with an array. Your logic might look a bit like this:
Of course are free to simplify the algorithm, going straight to ArrayLists even for
singles. My complex recipe is designed to conserve RAM on large lists.
Oracle’s Javadoc on
Hashtable class : available: