I was reading the javadocs on HashSet when I came across the interesting statement:
This class offers constant time performance for the basic operatio
The number of buckets is dynamic, and is approximately ~2n
, where n
is the number of elements in the set.
Note that HashSet
gives amortized and average time performance of O(1)
, not worst case. This means, we can suffer an O(n)
operation from time to time.
So, when the bins are too packed up, we just create a new, bigger array, and copy the elements to it.
This costs n
operations, and is done when number of elements in the set exceeds 2n/2=n
, so it means, the average cost of this operation is bounded by n/n=1
, which is a constant.
Additionally, the number of collisions a HashMap offers is also constant on average.
Assume you are adding an element x
. The probability of h(x)
to be filled up with one element is ~n/2n = 1/2
. The probability of it being filled up with 3 elements, is ~(n/2n)^2 = 1/4
(for large values of n
), and so on and so on.
This gives you an average running time of 1 + 1/2 + 1/4 + 1/8 + ...
. Since this sum converges to 2
, it means this operation takes constant time on average.
What I know about hashed structures is that to keep a O(1) complexity for insertion removal you need to have a good hash function to avoid collisions and the structure should not be full ( if the structure is full you will have collisions).
Normally hashed structures define a kind of fill limit, by example 70%. When the number of object make the structure be filled more than this limit than you should extend it size to stay below the limit and warranty performances. Generally you double the size of the structure when reaching the limit so that structure size grow faster than number of elements and reduce the number of resize/maintenance operations to perform
This is a kind of maintenance operation that consists on rehashing all elements contained int he structure to redistribute them in the resized structure. For sure this has a cost whose complexity is O(n) with n the number of elements stored in the structure but this cost is not integrated in the add function that will make the maintenance operation needed
I think this is what disturb you.
I learned also that the hash function generally depends on size of the structure that is used as parameter (there was something like max number of elements to reach the limit is a prime number of structure size to reduce the probability of collision or something like that) meaning that you don't change the hash function itself, you just change on of its parameters.
To answer to your comment there is not warranty if bucket 0 or 1 was filled that when you resize to 4 new elements will go inside bucket 3 and 4. Perhaps resizing to 4 make elements A and B now be in buckets 0 and 3
For sure all above is theorical and in real life you don`t have infinite memory, you can have collisions and maintenance has a cost etc so that's why you need to have an idea about the number of objects that you will store and do a trade off with available memory to try to choose an initial size of hashed structure that will limit the need to perform maintenance operations and allow you to stay in the O(1) performances