Should I use a `HashSet` or a `TreeSet` for a very large dataset?

后端 未结 2 1340
一个人的身影
一个人的身影 2021-02-09 00:44

I have a requirement to store 2 to 15 million Accounts (which are a String of length 15) in a data structure for lookup purpose and checking uniqueness. Initially I

2条回答
  •  臣服心动
    2021-02-09 01:11

    If you have 48 GB of dedicated Memory for your 2 million to 15 million records, your best bet is probably to use a HashMap, where your key is an Integer or a String depending on your requirements.

    You will be fine as far as hash collisions go as long as you give enough memory to the Map and have an appropriate load factor.

    I recommend using the following constructor: new HashMap<>(13_000_000); (30% more than your expected number of records - which will be automatically expanded by HashMap's implementation to 2^24 cells). Tell your application that this Map will be very large from the get-go so it doesn't need to automatically grow as you populate it.

    HashMap uses an O(1) access time for it's members, whereas TreeMap uses O(log n) lookup time, but can be more efficient with memory and doesn't need a clever hashing function. However, if you're using String or Integer keys, you don't need to worry about designing a hashing function and the constant time lookups will be a huge improvement. Also, another advantage of TreeMap / TreeSet is the sorted ordering, which you stated you don't care about; use HashMap.

    If the only purpose of the list is to check for unique account numbers, then everything I've said above is still true, but as you stated in your question, you should use a HashSet, not a HashMap. The performance recommendations and constructor argument is still applicable.

    Further reading: HashSet and TreeSet performance test

提交回复
热议问题