I\'m thinking about filling a collection with a large amount of unique objects. How is the cost of an insert in a Set (say HashSet) compared to an List (say ArrayList)?
You have to compare concrete implementations (for example HashSet
with ArrayList
), because the abstract interfaces Set
/List
don't really tell you anything about performance.
Inserting into a HashSet
is a pretty cheap operation, as long as the hashCode()
of the object to be inserted is sane. It will still be slightly slower than ArrayList
, because it's insertion is a simple insertion into an array (assuming you insert in the end and there's still free space; I don't factor in resizing the internal array, because the same cost applies to HashSet
as well).
I don't think you can make this judgement simply on the cost of building the collection. Other things that you need to take into account are:
These can all effect your choice of data structure.
There is no "duplicate elimination" such as comparing to all existing elements. If you insert into hash set, it's really a dictionary of items by hash code. There's no duplicate checking unless there already are items with the same hash code. Given a reasonable (well-distributed) hash function, it's not that bad.
As Will has noted, because of the dictionary structure HashSet
is probably a bit slower than an ArrayList
(unless you want to insert "between" existing elements). It also is a bit larger. I'm not sure that's a significant difference though.
If you're certain your data will be unique, use a List. You can use a Set to enforce this rule.
Sets are faster than Lists if you have a large data set, while the inverse is true for smaller data sets. I haven't personally tested this claim.
Which type of List?
Also, consider which List to use. LinkedLists are faster at adding, removing elements.
ArrayLists are faster at random access (for
loops, etc), but this can be worked around using the Iterator
of a LinkedList. ArrayLists are are much faster at: list.toArray()
.
You're right: set structures are inherently more complex in order to recognize and eliminate duplicates. Whether this overhead is significant for your case should be tested with a benchmark.
Another factor is memory usage. If your objects are very small, the memory overhead introduced by the set structure can be significant. In the most extreme case (TreeSet<Integer>
vs. ArrayList<Integer>
) the set structure can require more than 10 times as much memory.
If the goal is the uniqueness of the elements, you should use an implementation of the java.util.Set interface. The class java.util.HashSet and java.util.LinkedHashSet have O(alpha) (close to O(1) in the best case) complexity for insert, delete and contains check.
ArrayList
have O(n) for object (not index) contains check (you have to scroll through the whole list) and insertion (if the insertion is not in tail of the list, you have to shift the whole underline array).
You can use LinkedHashSet
that preserve the order of insertion and have the same potentiality of HashSet
(takes up only a bit more of memory).