LINQ Performance for Large Collections

前端 未结 6 1704
梦毁少年i
梦毁少年i 2021-02-02 11:48

I have a large collection of strings (up to 1M) alphabetically sorted. I have experimented with LINQ queries against this collection using HashSet, SortedDictionary, and Dictio

相关标签:
6条回答
  • 2021-02-02 11:57

    If you are trying to optimize looking up a list of strings with a given prefix you might want to take a look at implementing a Trie (not to be mistaken with a regular tree) data structure in C#.

    Tries offer very fast prefix lookups and have a very small memory overhead compared to other data structures for this sort of operation.

    About LINQ to Objects in general. It's not unusual to have a speed reduction compared to SQL. The net is littered with articles analyzing its performance.

    0 讨论(0)
  • 2021-02-02 12:00

    I bet you have an index on the column so SQL server can do the comparison in O(log(n)) operations rather than O(n). To imitate the SQL server behavior, use a sorted collection and find all strings s such that s >= query and then look at values until you find a value that does not start with s and then do an additional filter on the values. This is what is called a range scan (Oracle) or an index seek (SQL server).

    This is some example code which is very likely to go into infinite loops or have one-off errors because I didn't test it, but you should get the idea.

    // Note, list must be sorted before being passed to this function
    IEnumerable<string> FindStringsThatStartWith(List<string> list, string query) {
        int low = 0, high = list.Count - 1;
        while (high > low) {
            int mid = (low + high) / 2;
            if (list[mid] < query)
                low = mid + 1;
            else
                high = mid - 1;
        }
    
        while (low < list.Count && list[low].StartsWith(query) && list[low].Length > query.Length)
            yield return list[low];
            low++;
        }
    }
    
    0 讨论(0)
  • 2021-02-02 12:03

    Just looking at your code, I would say that you should reorder the comparison to take advantage of short-circuiting when using boolean operators:

    foreach (var stringitem in MyCollection.Where(
        x => x.Length > q.Length && x.StartsWith(query)).Take(limit))
    

    The comparison of length is always going to be an O(1) operation (as the length is being stored as part of the string, it doesn't count each character every time), whereas the call to StartsWith is going to be an O(N) operation, where N is the length of query (or the length of the string, whichever is smaller).

    By placing the comparison of length before the call to StartsWith, if that comparison fails, you save yourself some extra cycles which could add up when processing large numbers of items.

    I don't think that a lookup table is going to help you here, as lookup tables are good when you are comparing the entire key, not parts of the key, like you are doing with the call to StartsWith.

    Rather, you might be better off using a tree structure which is split based on the letters in the words in the list.

    However, at that point, you are really just recreating what SQL Server is doing (in the case of indexes) and that would just be a duplication of effort on your part.

    0 讨论(0)
  • 2021-02-02 12:06

    I think the problem is that Linq has no way to use the fact that your sequence is already sorted. Especially it cannot know, that applying the StartsWith function retains the order.

    I would suggest to use the List.BinarySearch method together with a IComparer<string> that does only comparison of the first query chars (this might be tricky, since it's not clear, if the query string will always be the first or the second parameter to ()).

    You could even use the standard string comparison, since BinarySearch returns a negative number which you can complement (using ~) in order to get the index of the first element that is larger than your query.

    You have then to start from the returned index (in both directions!) to find all elements matching your query string.

    0 讨论(0)
  • 2021-02-02 12:09

    In your current code you don't make use of any of the special features of the Dictionary / SortedDictionary / HashSet collections, you are using them the same way that you would use a List. That is why you don't see any difference in performance.

    If you use a dictionary as index where the first few characters of the string is the key and a list of strings is the value, you can from the search string pick out a small part of the entire collection of strings that has possible matches.

    I wrote the class below to test this. If I populate it with a million strings and search with an eight character string it rips through all possible matches in about 3 ms. Searching with a one character string is the worst case, but it finds the first 1000 matches in about 4 ms. Finding all matches for a one character strings takes about 25 ms.

    The class creates indexes for 1, 2, 4 and 8 character keys. If you look at your specific data and what you search for, you should be able to select what indexes to create to optimise it for your conditions.

    public class IndexedList {
    
        private class Index : Dictionary<string, List<string>> {
    
            private int _indexLength;
    
            public Index(int indexLength) {
                _indexLength = indexLength;
            }
    
            public void Add(string value) {
                if (value.Length >= _indexLength) {
                    string key = value.Substring(0, _indexLength);
                    List<string> list;
                    if (!this.TryGetValue(key, out list)) {
                        Add(key, list = new List<string>());
                    }
                    list.Add(value);
                }
            }
    
            public IEnumerable<string> Find(string query, int limit) {
                return
                    this[query.Substring(0, _indexLength)]
                    .Where(s => s.Length > query.Length && s.StartsWith(query))
                    .Take(limit);
            }
    
        }
    
        private Index _index1;
        private Index _index2;
        private Index _index4;
        private Index _index8;
    
        public IndexedList(IEnumerable<string> values) {
            _index1 = new Index(1);
            _index2 = new Index(2);
            _index4 = new Index(4);
            _index8 = new Index(8);
            foreach (string value in values) {
                _index1.Add(value);
                _index2.Add(value);
                _index4.Add(value);
                _index8.Add(value);
            }
        }
    
        public IEnumerable<string> Find(string query, int limit) {
            if (query.Length >= 8) return _index8.Find(query, limit);
            if (query.Length >= 4) return _index4.Find(query,limit);
            if (query.Length >= 2) return _index2.Find(query,limit);
            return _index1.Find(query, limit);
        }
    
    }
    
    0 讨论(0)
  • 2021-02-02 12:13

    If you're doing a "starts with", you only care about ordinal comparisons, and you can have the collection sorted (again in ordinal order) then I would suggest you have the values in a list. You can then binary search to find the first value which starts with the right prefix, then go down the list linearly yielding results until the first value which doesn't start with the right prefix.

    In fact, you could probably do another binary search for the first value which doesn't start with the prefix, so you'd have a start and an end point. Then you just need to apply the length criterion to that matching portion. (I'd hope that if it's sensible data, the prefix matching is going to get rid of most candidate values.) The way to find the first value which doesn't start with the prefix is to search for the lexicographically-first value which doesn't - e.g. with a prefix of "ABC", search for "ABD".

    None of this uses LINQ, and it's all very specific to your particular case, but it should work. Let me know if any of this doesn't make sense.

    0 讨论(0)
提交回复
热议问题