For Strings, Hashtables work pretty much as you would expect if you are familiar with hashing. The only catch is
remembering to import java.util.Hashtable (with a small t in table). Here is how you use a Hashtable:
The key point for novices to remember is the capacity of the table has to be somewhat larger than the number of elements
you plan to look up. The more spare capacity you give, the faster the lookup. There are of course diminishing returns.
Typically, you specify about 25% more capacity than you have elements and a load factor
of 0.75 which controls automatic growth of the lookup table when it gets too full. More details on that later.
The capacity of a table is not the number of elements you plan to add! If you specify a
capacity of 100 and a load factor of .75, Hashtable will allocate only 100 slots, not the
133 you need for efficient access. The load factor kicks in only when deciding when it is time to double the size of the
table, in this case, just after you have added 75 elements, not in deciding its initial size. It is up to you to leave
some breathing room.
Tips
- Hashtables work fastest if the capacity is a prime number since it reduces collisions. Your hashcodes are unlikely to
have a frequency distribution pattern based on modulo a large prime number, where they easily could have one based on a
power of two. When you use a power of two, you are effectively masking off all but the low order bits. In Java 1.4.1+,
for HashMap there no longer a payback for specifying prime table lengths because HashMap
rounds whatever capacity you specify up to a power of two, to avoid the cost of a division (they can do an AND bit mask
instead) and to help keep memory from being fragmented. However, Hashtable, which is
synchronized, uses the value you give it, so giving it a prime is still worth while.
Now it is more important than ever to have a high quality hashCode function that has good
distribution, especially in the low order bits. It must not favour one bit pattern over another, e.g. not tend to
generate hashcodes all ending in binary 0000. In JDK 1.4.1+ there in a rehash function used
inside HashMap to improve the quality of your provided hashCode
by redistributing the bits. rehash has to be quick, or they might as well have used tables
whose lengths are multiples of primes and used the modulus operator.
In theory though, you should round your capacity up to prime number. However, avoid 31 or multiples of 31 since the new
hashCode algorithm for Strings in JDK 1.2+ uses 31 as a magic number in the hashing formula. In a Hashtable, you are
only using 4 bytes per additional entry. Here are some primes just larger than various powers of two that would be good
choices.
| 11 |
17 |
37 |
67 |
131 |
257 |
521 |
1031 |
2053 |
4099 |
8209 |
16411 |
32771 |
65537 |
131101 |
262147 |
524309 |
1048583 |
2097169 |
4194319 |
8388617 |
- You cannot have elements with duplicate keys. If you add a new element that duplicates the key of an existing element,
it will replace it.
- The initial capacity is not the number of elements you plan to put in the table. If you specified a capacity of
101 and a load factor of .75f, then the initial table could hold only 75 elements, before it had to grow. This means you
must specify plenty of slop in your capacity setting. If you used a load factor of 1.0, your performance would be slow
because you would have many collisions rather than direct lookups.
- Hashtables will automatically grow when you add too many elements. However, growing requires copying, rehashing and
rechaining, quite an involved process. That is why you want an accurate estimate for the initial capacity, slightly on
the high side.
- A load factor of .75f means that the table will be expanded when it gets three quarters full. Hashing works faster when
the table is not full. A value of .25f would use more RAM but would work faster. A value of 1.0f would use minimal RAM
but would run extremely slowly. The reason you want a decent supply of unusued slots is to reduce collisions, two
entries that hash to the same slot. They have to be resolved by chasing a linear chain. If there are very few empty
slots left, and new entries are assigend effectively random slots, the odds are they will be assigned a slot already
taken. If there is almost no room left, the chains of elements that hash to the same slot rapidly grow longer. That is
why you have a load factor, to grow the entire structure when space gets tight, and keep the chains of elements that
collide on the same slot as short as possible.
- HashCodes are usually not unique. If you think about it, how could they be? Consider Strings of length 10 characters (20
bytes, 160 bits). How many different possible strings are there? 2^160. HashCodes are only 32 bits long. How many
possible different hashCodes are there? 2^32. You can see there are way more strings that HashCodes, so strings would
have to share HashCodes.
- If two keys hash to the same hashCode, or to the same hashCode modulo the size of the table, all still works. It just
takes a little longer to look up that key, since get() has to
chase a chain of clashing keys to find the match. Keys are matched with the equals() method, not ==, or simply by
comparing hashCodes.
- Let me repeat this another way since it is a point people seem to miss on first reading. Collisions happen not only when
two of the HashCodes are identical, (quite infrequent) but when two of the Hashcodes modulo
the initial capacity are equal (which almost always happens at least a few times). Collisions are unavoidable. The Hashtable
class deals with them automatically. They slow down lookup, but do not derail it. The smaller the load factor, the more
RAM you waste, but the fewer the collisions, so the faster the lookup. Using prime numbers for the divisor/capacity also
tends to reduce collisions.
- This is the same math that comes up when you ask how many people do you need to have in a room for it to be likely that
two people have the same birthday. There’s a > 50% chance with 23 people which
is a bit more than the square root of 365.
- The source code for Hashtable is very short. It takes an order of magnitude more English to
describe how it works than does the Java code itself. I suggest you peruse it if you are in the least curious how it
works inside. Hashing is something every programmer should understand. You might also study String.hashCode().
- Hashtables use generic objects or both keys and values. You need to cast these back to
Strings, or whatever you are using.
- If you want to use something other than Strings as your keys, make sure you implement an equals()
and hashCode()method for those key objects to override the
default Object equals() and hashCode()
methods. The default equals() method just checks that the two
objects are one in the same (using ==), not that they consist of the same bit patterns.
- There are two methods for enumerating the Hashtable, keys()
which gives you the keys, and elements() which gives you the
values. You could step through both enumerations simultaneously instead of using get()
as I showed. JDK 1.2 HashMap (the unsynchronized replacement for Hashtable) lets you iterate
over keys, values or key-value pairs. Note, the enumeration is not sorted in any particular order, not even the
order you added your elements. You need either keyset().iterator(), values().iterator()
or entryset().iterator() for pairs. For sorted Hashtables, you need to sort the output
yourself with Arrays.sort( hashtable.values().toArray( new X[hashtable.size()]) ) or use a TreeMap.
- The size method keeps track of how many unique keys are in the Hashtable, not the size of
the table including empty slots. Hashtables never contain items with duplicate keys.
- Wherever possible use Vectors or ArrayLists in preference to Hashtables
when you can assign a unique dense int key. Vectors are much
faster to look up when you know the index. The Vector.get index lookup process does not
scan through other objects, (dragging them from backing store into real RAM). Thus Vectors
even moreso are better than Hashtables when your virtual RAM is not backed totally by real
RAM. Vector.contains in contrast does a linear search, comparing each object in turn. This
is much slower than a Hashtable lookup except for very small tables. If you are single
thread, use ArrayList in preference to Vector. You can build a
free slot chain through the embedded unused Vector slots to rapidly recycle numbers.
- Never disturb fields in the Hashtable/HashMap/ HashSet
key objects that would change the hashCode or the results of equals. Hashtable requires these to sit still for it to
work. You can change them, but only when the Objects are not actively being used as keys.
- Hashtable and HashMap do not cache a recent get.
So if you frequently fetch the same item over and over, you might consider keeping a copy of the previous get
result.
- Hashtable is a Map, not a Collection.
However, the result of Hashtable. values and Hashtable.
keySet is a Collection. You can use Collection.
toArray to export either to an array.
Dealing With Duplicates
Hashtables don’t permit duplicates. What do you do if some of your objects have duplicate keys? When you have
duplicates, what do you want to find when you do a lookup? Just the first, just the last? both? If you are clever, you
ca get all those behaviours with a Hashtable.
To get the first duplicate, before you put, do a check to see if the element is already in
the Hashtable, if so, do nothing.
To get the last duplicate, just put. It will replace what was there before. To get both,
store an ArrayList as the value. Store each of the duplicates in one of the slots of the ArrayList.
This is a bit inefficient for non-duplicates. For them, you could store just the plain Object.
Any you might handle the special case of just one duplicate with an array. Your logic might look a bit like this:
Of course are free to simplify the algorithm, going straight to ArrayLists even for singles.
My complex recipe is designed to conserve RAM on large lists.
Sun’s Javadoc on the
HashTable class : available: