OrderBy and Top in LINQ with good performance

梦想的初衷 提交于 2019-12-20 12:11:13

问题


What is a good way to get the top 10 records from a very large collection and use a custom OrderBy? If I use the LINQ to Objects OrderBy method it is slow and takes a lot of memory because it creates an entire new collection with the new order. I would like a new method with the signature below that does not re-order the entire collection and is very fast:

public static IEnumerable<TSource> OrderByTop<TSource, TKey>(
    IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer,
    int topCount)

I tried to write it but it got very complicated and I thought there might be any easier way using Aggregate or something. Any help would be appreciated.

Answer

Thanks for the help. I ended up with the code below:

public static List<TSource> OrderByTop<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer,
    int topCount)
{
    var itemComparer = keySelector.ToIComparer(comparer);
    return source.Aggregate(
        new List<TSource>(topCount),
        (List<TSource> list, TSource item) =>
            list.SortedInsert(item, itemComparer, topCount));
}

The List Extension method SortedInsert follows:

public static List<T> SortedInsert<T>(
    this List<T> list,
    T item,
    IComparer<T> comparer,
    int maxLength)
{
    if (list.Count == maxLength)
        if (comparer.Compare(item, list[maxLength - 1]) >= 0)
            return list;
        else
            list.RemoveAt(maxLength - 1);
    int insertIndex = list.BinarySearch(item, comparer);
    if (insertIndex < 0)
        insertIndex = ~insertIndex;
    list.Insert(insertIndex, item);
    return list;
}

For those interested I also had keySelector Extension method to convert to IComparer.

public static IComparer<TSource> ToIComparer<TSource, TKey>(
    this Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer)
{
    return new KeySelectorToIComparerConverter<TSource, TKey>(
        keySelector,
        comparer);
}
private class KeySelectorToIComparerConverter<TSource, TKey>
    : IComparer<TSource>
{
    private readonly IComparer<TKey> comparer;
    private readonly Func<TSource, TKey> keySelector;
    public KeySelectorToIComparerConverter(
        Func<TSource, TKey> keySelector,
        IComparer<TKey> comparer)
    {
        this.comparer = comparer;
        this.keySelector = keySelector;
    }
    public int Compare(TSource x, TSource y)
    {
        return comparer.Compare(keySelector(x), keySelector(y));
    }
}

回答1:


Aggregate is a good place to start with:

SortedList<TKey, TSource> resultlist = new SortedList<TKey, TSource>();
MyBigList.Aggregate(resultlist, (aktlist,entry) => {
   aktlist.Add(entry.Key, entry);
   if (aktlist.Count > 10) aktlist.RemoveAt(10);
   return aktlist;
});

If you want a different comparer, you can specify one in the constructor of the SortedList.

EDIT As mentioned by nikie, SortedListcannot contain double values. You can use a standard list together with BinarySearch to achieve the same effect:

List<TSource> resultlist = new List<TSource>();
MyBigList.Aggregate(resultlist, (aktlist, entry) => {
   int index = aktlist.BinarySearch(entry);
   if (index < 0) index = ~index;
   if (index < 10) aktlist.Insert(index, entry);
   if (aktlist.Count > 10) aktlist.RemoveAt(10);
   return aktlist;
});

Again a custom comparer (together with a custom key selection) can be used as parameter to BinarySearch.




回答2:


I think what you want is really a selection algorithm. I don't know that LINQ is the best way to implement one since I think it basically ends up as selection by sorting. You ought to be able to do this in O(kN), where k is the "top" number of items by iterating through the collection, keeping track of the minimum "top" element seen so far and if the current element is bigger than that, replacing that element with the current element (and updating the new minimum element). This is space efficient as well.

When you are done you can return the "top" elements as an ordered collection.

Note: I'm assuming LINQ to Objects here. If you are using LINQ to SQL, then I'd defer simply defer the ordering/selection to the SQL server and simply chain the methods appropriately to get a select top N ... from ... order by ... query.

Completely untested, not even compiled. Uses a generic Fibonacci Heap implementation. I'll post the code on my blog (http://farm-fresh-code.blogspot.com) sometime soon. I've got one hanging around (not sure if it's generic) as a result of some experiments with priority queues that I was doing. See wikipedia for info and pseudocode until then.

public static IEnumerable<TSource> OrderByTop<TSource, TKey>(
    IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer,
    int topCount)
{
    // allocate enough space to hold the number of elements (+1 as a new candidate is added)
    FibonacciHeap<TKey,TSource> top = new FibonacciHeap<TKey,TSource>( comparer );
    foreach (var candidate in source) // O(n)
    {
         TKey key = keySelector(candidate);
         TKey minimum = top.AccessMinimum();
         if (minimum == null || comparer.Compare( key, minimum.Key ) > 0) // O(1)
         {
             top.Insert( key, candidate ); // O(1)
             if (top.Count >= topCount)
             {
                 top.DeleteMinimum(); // O(logk)
             }
         }
    }
    return top.ToList().Reverse().Select( t.Value ); // O(k)   
}



回答3:


I do not know an other solution than writing this method. However this method should not be that complicated.

You need to maintain a sorted list with the top 10 elements, and iterate through the orinigal collection once.

If the current record during the iteration is smaller, than the last one from the top 10 list, or when you do not have your first 10 records yet, then you have to add the item to this list. (And of course, remove the last item from the top 10 list, when appropriate.)




回答4:


You could also implement a divide-and-conquer sorting algorithm like quicksort and break as soon as you have the first k sorted elements. But tvanfosson's suggestion is probably faster if k << N



来源:https://stackoverflow.com/questions/2078048/orderby-and-top-in-linq-with-good-performance

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!