.NET: How to efficiently check for uniqueness in a List of 50,000 items?

前端 未结 6 1325
南旧
南旧 2021-02-01 15:28

In some library code, I have a List that can contain 50,000 items or more.

Callers of the library can invoke methods that result in strings being added to the list. Ho

相关标签:
6条回答
  • 2021-02-01 15:47

    You should use the HashSet<T> class, which is specifically designed for what you're doing.

    0 讨论(0)
  • 2021-02-01 15:49

    Does the Contains(T) function not work for you?

    0 讨论(0)
  • 2021-02-01 15:55

    Use HashSet<string> instead of List<string>, then it should scale very well.

    0 讨论(0)
  • 2021-02-01 16:02

    Possibly off-topic, but if you want to scale very large unique sets of strings (millions+) in a language-independent way, you might check out Bloom Filters.

    0 讨论(0)
  • 2021-02-01 16:04

    From my tests, HashSet<string> takes no time compared to List<string> :)

    0 讨论(0)
  • 2021-02-01 16:06

    I have read that dictionary<> is implemented as an associative array. In some languages (not necessarily anything related to .NET), string indexes are stored as a tree structure that forks at each node based upon the character in the node. Please see http://en.wikipedia.org/wiki/Associative_arrays.

    A similar data structure was devised by Aho and Corasick in 1973 (I think). If you store 50,000 strings in such a structure, then it matters not how many strings you are storing. It matters more the length of the strings. If they are are about the same length, then you will likely never see a slow-down in lookups because the search algorithm is linear in run-time with respect to the length of the string you are searching for. Even for a red-black tree or AVL tree, the search run-time depends more upon the length of the string you are searching for rather than the number of elements in the index. However, if you choose to implement your index keys with a hash function, you now incurr the cost of hashing the string (going to be O(m), m = string length) and also the lookup of the string in the index, which will likely be on the order of O(log(n)), n = number of elements in the index.

    edit: I'm not a .NET guru. Other more experienced people suggest another structure. I would take their word over mine.

    edit2: your analysis is a little off for comparing uniqueness. If you use a hashing structure or dictionary, then it will not be an O(n^2) operation because of the reasoning I posted above. If you continue to use a list, then you are correct that it is O(n^2) * (max length of a string in your set) because you must examine each element in the list each time.

    0 讨论(0)
提交回复
热议问题