I am exploring the HashSet
type, but I don\'t understand where it stands in collections.
Can one use it to replace a List
Here's a real example of where I use a HashSet<string>
:
Part of my syntax highlighter for UnrealScript files is a new feature that highlights Doxygen-style comments. I need to be able to tell if a @
or \
command is valid to determine whether to show it in gray (valid) or red (invalid). I have a HashSet<string>
of all the valid commands, so whenever I hit a @xxx
token in the lexer, I use validCommands.Contains(tokenText)
as my O(1) validity check. I really don't care about anything except existence of the command in the set of valid commands. Lets look at the alternatives I faced:
Dictionary<string, ?>
: What type do I use for the value? The value is meaningless since I'm just going to use ContainsKey
. Note: Before .NET 3.0 this was the only choice for O(1) lookups - HashSet<T>
was added for 3.0 and extended to implement ISet<T>
for 4.0.List<string>
: If I keep the list sorted, I can use BinarySearch
, which is O(log n) (didn't see this fact mentioned above). However, since my list of valid commands is a fixed list that never changes, this will never be more appropriate than simply...string[]
: Again, Array.BinarySearch
gives O(log n) performance. If the list is short, this could be the best performing option. It always has less space overhead than HashSet
, Dictionary
, or List
. Even with BinarySearch
, it's not faster for large sets, but for small sets it'd be worth experimenting. Mine has several hundred items though, so I passed on this.HashSet is a set implemented by hashing. A set is a collection of values containing no duplicate elements. The values in a set are also typically unordered. So no, a set can not be used to replace a list (unless you should've use a set in the first place).
If you're wondering what a set might be good for: anywhere you want to get rid of duplicates, obviously. As a slightly contrived example, let's say you have a list of 10.000 revisions of a software projects, and you want to find out how many people contributed to that project. You could use a Set<string>
and iterate over the list of revisions and add each revision's author to the set. Once you're done iterating, the size of the set is the answer you were looking for.
In the basic intended scenario HashSet<T>
should be used when you want more specific set operations on two collections than LINQ provides. LINQ methods like Distinct
, Union
, Intersect
and Except
are enough in most situations, but sometimes you may need more fine-grained operations, and HashSet<T>
provides:
UnionWith
IntersectWith
ExceptWith
SymmetricExceptWith
Overlaps
IsSubsetOf
IsProperSubsetOf
IsSupersetOf
IsProperSubsetOf
SetEquals
Another difference between LINQ and HashSet<T>
"overlapping" methods is that LINQ always returns a new IEnumerable<T>
, and HashSet<T>
methods modify the source collection.
In short - anytime you are tempted to use a Dictionary (or a Dictionary where S is a property of T) then you should consider a HashSet (or HashSet + implementing IEquatable on T which equates on S)
Probably the most common use for hashsets is to see whether they contain a certain element, which is close to an O(1) operation for them (assuming a sufficiently strong hashing function), as opposed to lists for which check for inclusion is O(n) (and sorted sets for which it is O(log n)). So if you do a lot of checks, whether an item is contained in some list, hahssets might be a performance improvement. If you only ever iterate over them, there won't be much difference (iterating over the whole set is O(n), same as with lists and hashsets have somewhat more overhead when adding items).
And no, you can't index a set, which would not make sense anyway, because sets aren't ordered. If you add some items, the set won't remember which one was first, and which second etc.
A HashSet<T>
implements the ICollection<T>
interface:
public interface ICollection<T> : IEnumerable<T>, IEnumerable
{
// Methods
void Add(T item);
void Clear();
bool Contains(T item);
void CopyTo(T[] array, int arrayIndex);
bool Remove(T item);
// Properties
int Count { get; }
bool IsReadOnly { get; }
}
A List<T>
implements IList<T>
, which extends the ICollection<T>
public interface IList<T> : ICollection<T>
{
// Methods
int IndexOf(T item);
void Insert(int index, T item);
void RemoveAt(int index);
// Properties
T this[int index] { get; set; }
}
A HashSet has set semantics, implemented via a hashtable internally:
A set is a collection that contains no duplicate elements, and whose elements are in no particular order.
What does the HashSet gain, if it loses index/position/list behavior?
Adding and retrieving items from the HashSet is always by the object itself, not via an indexer, and close to an O(1) operation (List is O(1) add, O(1) retrieve by index, O(n) find/remove).
A HashSet's behavior could be compared to using a Dictionary<TKey,TValue>
by only adding/removing keys as values, and ignoring dictionary values themselves. You would expect keys in a dictionary not to have duplicate values, and that's the point of the "Set" part.