How to check an IEnumerable for multiple conditions with a single enumeration without buffering?

让人想犯罪 __ 提交于 2020-07-08 20:31:23

问题


I have a very long sequence of data is the form of IEnumerable, and I would like to check it for a number of conditions. Each condition returns a value of true or false, and I want to know if all conditions are true. My problem is that I can not afford to materialize the IEnumerable by calling ToList, because it is simply too long (> 10,000,000,000 elements). Neither I can afford to enumerate the sequence multiple times, one for each condition, because each time I will get a different sequence. I am searching for an efficient way to perform this check, using the existing LINQ functionality if possible.


Clarification: I am asking for a general solution, not for a solution of the specific example problem that is presented bellow.


Here is a dummy version of my sequence:

static IEnumerable<int> GetLongSequence()
{
    var random = new Random();
    for (long i = 0; i < 10_000_000_000; i++) yield return random.Next(0, 100_000_000);
}

And here is an example of the conditions that the sequence must satisfy:

var source = GetLongSequence();
var result = source.Any(n => n % 28_413_803 == 0)
    && source.All(n => n < 99_999_999)
    && source.Average(n => n) > 50_000_001;

Unfortunately this approach invokes three times the GetLongSequence, so it doesn't satisfy the requirements of the problem.

I tried to write a Linqy extension method of the above, hoping that this could give me some ideas:

public static bool AllConditions<TSource>(this IEnumerable<TSource> source,
    params Func<IEnumerable<TSource>, bool>[] conditions)
{
    foreach (var condition in conditions)
    {
        if (!condition(source)) return false;
    }
    return true;
}

This is how I intend to use it:

var result = source.AllConditions
(
    s => s.Any(n => n % 28_413_803 == 0),
    s => s.All(n => n < 99_999_999),
    s => s.Average(n => n) > 50_000_001,
    // more conditions...
);

Unfortunately this offers no improvement. The GetLongSequence is again invoked three times.

After hitting my head against the wall for an hour, without making any progress, I figured out a possible solution. I could run each condition in a separate thread, and synchronize their access to a single shared enumerator of the sequence. So I ended up with this monstrosity:

public static bool AllConditions<TSource>(this IEnumerable<TSource> source,
    params Func<IEnumerable<TSource>, bool>[] conditions)
{
    var locker = new object();
    var enumerator = source.GetEnumerator();
    var barrier = new Barrier(conditions.Length);
    long index = -1;
    bool finished = false;

    IEnumerable<TSource> OneByOne()
    {
        try
        {
            while (true)
            {
                TSource current;
                lock (locker)
                {
                    if (finished) break;
                    if (barrier.CurrentPhaseNumber > index)
                    {
                        index = barrier.CurrentPhaseNumber;
                        finished = !enumerator.MoveNext();
                        if (finished)
                        {
                            enumerator.Dispose(); break;
                        }
                    }
                    current = enumerator.Current;
                }
                yield return current;
                barrier.SignalAndWait();
            }
        }
        finally
        {
            barrier.RemoveParticipant();
        }
    }

    var results = new ConcurrentQueue<bool>();
    var threads = conditions.Select(condition => new Thread(() =>
    {
        var result = condition(OneByOne());
        results.Enqueue(result);
    })
    { IsBackground = true }).ToArray();
    foreach (var thread in threads) thread.Start();
    foreach (var thread in threads) thread.Join();
    return results.All(r => r);
}

For the synchronization a used a Barrier. This solution actually works way better than I thought. It can process almost 1,000,000 elements per second in my machine. It is not fast enough though, since it needs almost 3 hours to process the full sequence of 10,000,000,000 elements. And I can't wait for the result for longer than 5 minutes. Any ideas about how I could run these conditions efficiently in a single thread?


回答1:


If you need to ensure that the sequence is enumerated only once, conditions operating on the whole sequence are not useful. One possibility that comes to my mind is to have an interface which is called for each element of the sequence and implement this interface in different ways for your specific conditions:

bool Example()
{
    var source = GetLongSequence();

    var conditions = new List<IEvaluate<int>>
    {
        new Any<int>(n => n % 28_413_803 == 0),
        new All<int>(n => n < 99_999_999),
        new Average(d => d > 50_000_001)
    };

    foreach (var item in source)
    {
        foreach (var condition in conditions)
        {
            condition.Evaluate(item);
        }
    }

    return conditions.All(c => c.Result);   
}

static IEnumerable<int> GetLongSequence()
{
    var random = new Random();
    for (long i = 0; i < 10_000_000_000; i++) yield return random.Next(0, 100_000_000);
}

interface IEvaluate<T>
{
    void Evaluate(T item);
    bool Result { get; }
}

class Any<T> : IEvaluate<T>
{
    private bool _result;
    private readonly Func<T, bool> _predicate;

    public Any(Func<T, bool> predicate)
    {
        _predicate = predicate;
        _result = false;
    }

    public void Evaluate(T item)
    {
        if (_predicate(item))
        {
            _result = true;
        }
    }

    public bool Result => _result;
}


class All<T> : IEvaluate<T>
{
    private bool _result;
    private readonly Func<T, bool> _predicate;

    public All(Func<T, bool> predicate)
    {
        _predicate = predicate;
        _result = true;
    }

    public void Evaluate(T item)
    {
        if (!_predicate(item))
        {
            _result = false;
        }
    }

    public bool Result => _result;
}

class Average : IEvaluate<int>
{
    private long _sum;
    private int _count;
    Func<double, bool> _evaluate;
    public Average(Func<double, bool> evaluate)
    {
    }

    public void Evaluate(int item)
    {
        _sum += item;
        _count++;
    }

    public bool Result => _evaluate((double)_sum / _count);
}



回答2:


If all you want is check for these three conditions on a single thread in only one enumeration, I wouldn't use LINQ and manually aggregate the checks:

bool anyVerified = false;
bool allVerified = true;
double averageSoFar = 0;

foreach (int n in GetLongSequence()) {
    anyVerified = anyVerified || n % 28_413_803 == 0;
    allVerified = allVerified && n < 99_999_999;
    averageSoFar += n / 10_000_000_000;
    // Early out conditions here...
}
return anyVerified && allVerified && averageSoFar > 50_000_001;

This could be made more generic if you plan to do these checks often but it looks like it satisfies all your requirements.




回答3:


Can I also suggest you another method based on the Enumerable.Aggregate LINQ extension method.

public static class Parsing {
    public static bool ParseOnceAndCheck(this IEnumerable<int> collection, Func<int, bool> all, Func<int, bool> any, Func<double, bool> average) {
        // Aggregate the two boolean results, the sum of all values and the count of values...
        (bool allVerified, bool anyVerified, int sum, int count) = collection.Aggregate(
            ValueTuple.Create(true, false, 0, 0),
            (tuple, item) => ValueTuple.Create(tuple.Item1 && all(item), tuple.Item2 || any(item), tuple.Item3 + item, tuple.Item4 + 1)
        );
        // ... and summarizes the result
        return allVerified && anyVerified && average(sum / count);
    }
}

You could call this extension method in a very similar way than you would usual LINQ methods but there would be only one enumeration of your sequence:

IEnumerable<int> sequence = GetLongSequence();
bool result = sequence.ParseOnceAndCheck(
    all: n => n < 99_999_999,
    any: n => n % 28_413_803 == 0,
    average: a => a > 50_000_001
);



回答4:


I found a single-threaded solution that uses the Reactive Extensions library. On the one hand it's an excellent solution regarding features and ease of use, since all methods that are available in LINQ for IEnumerable are also available in RX for IObservable. On the other hand it is a bit disappointing regarding performance, as it is as slow as my wacky multi-threaded solution that is presented inside my question.


Update: I discarded the previous two implementations (one using the method Replay, the other using the method Publish) with a new one that uses the class Subject. This class is a low-level combination of an IObservable and IObserver. I am posting to it the items of the source IEnumerable, which are then propagated to all the IObservable<bool>'s provided by the caller. The performance is now decent, only 40% slower than Klaus Gütter's excellent solution. Also I can now break from the loop early if a condition (like All) can be determined to be false before the end of the enumeration.

public static bool AllConditions<TSource>(this IEnumerable<TSource> source,
    params Func<IObservable<TSource>, IObservable<bool>>[] conditions)
{
    var subject = new Subject<TSource>();
    var result = true;
    foreach (var condition in conditions)
    {
        condition(subject).SingleAsync().Subscribe(onNext: value =>
        {
            if (value) return;
            result = false;
        });
    }
    foreach (var item in source)
    {
        if (!result) break;
        subject.OnNext(item);
    }
    return result;
}

Usage example:

var result = source.AllConditions
(
    o => o.Any(n => n % 28_413_803 == 0),
    o => o.All(n => n < 99_999_999),
    o => o.Average(n => n).Select(v => v > 50_000_001)
);

Each condition should return an IObservable containing a single boolean value. This is not enforcible by the RX API, so I used the System.Reactive.Linq.SingleAsync method to enforce it at runtime (by throwing an exception if a result doesn't comply to this contract).



来源:https://stackoverflow.com/questions/58578480/how-to-check-an-ienumerable-for-multiple-conditions-with-a-single-enumeration-wi

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!