I have a list (List
) and I want to index its objects by their ids using a map (HashMap
). I always use list.si
The rule of thumb if you don't know load factor/capacity internals :
initialCapacityToUse = (Expected No. of elements in map / 0.75) + 1
With this initial capacity value, the rehash will not occur for storing given expected no. of elements in map.
What you're doing is fine. In this way you're sure that the hash map has at least enough capacity for the initial values. If you have more information regarding the usage patterns of the hash map (example: is it updated frequently? are many new elements added frequently?), you might want to set a bigger initial capacity (for instance, list.size() * 2
), but never lower. Use a profiler to determine if the initial capacity is falling short too soon.
UPDATE
Thanks to @PaulBellora for suggesting that the initial capacity should be set to (int)Math.ceil(list.size() / loadFactor)
(typically, the default load factor is 0.75) in order to avoid an initial resize.
Guava's Maps.newHashMapWithExpectedSize uses this helper method to calculate initial capacity for the default load factor of 0.75
, based on some expected number of values:
/**
* Returns a capacity that is sufficient to keep the map from being resized as
* long as it grows no larger than expectedSize and the load factor is >= its
* default (0.75).
*/
static int capacity(int expectedSize) {
if (expectedSize < 3) {
checkArgument(expectedSize >= 0);
return expectedSize + 1;
}
if (expectedSize < Ints.MAX_POWER_OF_TWO) {
return expectedSize + expectedSize / 3;
}
return Integer.MAX_VALUE; // any large value
}
reference: source
From the newHashMapWithExpectedSize
documentation:
Creates a
HashMap
instance, with a high enough "initial capacity" that it should holdexpectedSize
elements without growth. This behavior cannot be broadly guaranteed, but it is observed to be true for OpenJDK 1.6. It also can't be guaranteed that the method isn't inadvertently oversizing the returned map.
The 'capacity' keyword is incorrect by definition and isn't used in the way typically expected.
By default the 'load factor' of a HashMap is 0.75, this means that when the number of entries in a HashMap reaches 75% of the capacity supplied, it will resize the array and rehash.
For example if I do:
Map<Integer, Integer> map = new HashMap<>(100);
When I am adding the 75th entry, the map will resize the Entry table to 2 * map.size() (or 2 * table.length). So we can do a few things:
The best option is the latter of the two, let me explain what's going on here:
list.size() / 0.75
This will return list.size() + 25% of list.size(), for example if my list had a size of 100 it would return 133. We then add 1 to it as the map is resized if the size of it is equal to 75% of the initial capacity, so if we had a list with a size of 100 we would be setting the initial capacity to 134, this would mean that adding all 100 entries from the list would not incur any resize of the map.
End result:
Map<Integer, Integer> map = new HashMap<>(list.size() / 0.75 + 1);
According to the reference documentation of java.util.HashMap:
The expected number of entries in the map and its load factor should be taken into account when setting its initial capacity, so as to minimize the number of rehash operations. If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.
This means, if you know in advance, how many entries the HashMap should store, you can prevent rehashing by choosing an appropriate initial capacity and load factor. However:
As a general rule, the default load factor (.75) offers a good tradeoff between time and space costs. Higher values decrease the space overhead but increase the lookup cost (reflected in most of the operations of the HashMap class, including get and put).
If you wish to avoid rehashing the HashMap
, and you know that no other elements will be placed into the HashMap
, then you must take into account the load factor as well as the initial capacity. The load factor for a HashMap defaults to 0.75.
The calculation to determine whether rehashing is necessary occurs whenever an new entry is added, e.g. put
places a new key/value. So if you specify an initial capacity of list.size()
, and a load factor of 1, then it will rehash after the last put
. So to prevent rehashing, use a load factor of 1 and a capacity of list.size() + 1
.
EDIT
Looking at the HashMap
source code, it will rehash if the old size meets or exceeds the threshold, so it won't rehash on the last put
. So it looks like a capacity of list.size()
should be fine.
HashMap<Integer, T> map = new HashMap<Integer, T>(list.size(), 1.0);
Here's the relevant piece of HashMap
source code:
void addEntry(int hash, K key, V value, int bucketIndex) {
Entry<K,V> e = table[bucketIndex];
table[bucketIndex] = new Entry<>(hash, key, value, e);
if (size++ >= threshold)
resize(2 * table.length);
}