Fastest way to check if a value exists in a list

后端 未结 12 2076
猫巷女王i
猫巷女王i 2020-11-22 00:18

What is the fastest way to know if a value exists in a list (a list with millions of values in it) and what its index is?

I know that all values in the list are uniqu

相关标签:
12条回答
  • 2020-11-22 00:46

    Code to check whether two elements exist in array whose product equals k:

    n = len(arr1)
    for i in arr1:
        if k%i==0:
            print(i)
    
    0 讨论(0)
  • 2020-11-22 00:47

    Or use __contains__:

    sequence.__contains__(value)
    

    Demo:

    >>> l=[1,2,3]
    >>> l.__contains__(3)
    True
    >>> 
    
    0 讨论(0)
  • 2020-11-22 00:48

    The original question was:

    What is the fastest way to know if a value exists in a list (a list with millions of values in it) and what its index is?

    Thus there are two things to find:

    1. is an item in the list, and
    2. what is the index (if in the list).

    Towards this, I modified @xslittlegrass code to compute indexes in all cases, and added an additional method.

    Results

    Methods are:

    1. in--basically if x in b: return b.index(x)
    2. try--try/catch on b.index(x) (skips having to check if x in b)
    3. set--basically if x in set(b): return b.index(x)
    4. bisect--sort b with its index, binary search for x in sorted(b). Note mod from @xslittlegrass who returns the index in the sorted b, rather than the original b)
    5. reverse--form a reverse lookup dictionary d for b; then d[x] provides the index of x.

    Results show that method 5 is the fastest.

    Interestingly the try and the set methods are equivalent in time.


    Test Code

    import random
    import bisect
    import matplotlib.pyplot as plt
    import math
    import timeit
    import itertools
    
    def wrapper(func, *args, **kwargs):
        " Use to produced 0 argument function for call it"
        # Reference https://www.pythoncentral.io/time-a-python-function/
        def wrapped():
            return func(*args, **kwargs)
        return wrapped
    
    def method_in(a,b,c):
        for i,x in enumerate(a):
            if x in b:
                c[i] = b.index(x)
            else:
                c[i] = -1
        return c
    
    def method_try(a,b,c):
        for i, x in enumerate(a):
            try:
                c[i] = b.index(x)
            except ValueError:
                c[i] = -1
    
    def method_set_in(a,b,c):
        s = set(b)
        for i,x in enumerate(a):
            if x in s:
                c[i] = b.index(x)
            else:
                c[i] = -1
        return c
    
    def method_bisect(a,b,c):
        " Finds indexes using bisection "
    
        # Create a sorted b with its index
        bsorted = sorted([(x, i) for i, x in enumerate(b)], key = lambda t: t[0])
    
        for i,x in enumerate(a):
            index = bisect.bisect_left(bsorted,(x, ))
            c[i] = -1
            if index < len(a):
                if x == bsorted[index][0]:
                    c[i] = bsorted[index][1]  # index in the b array
    
        return c
    
    def method_reverse_lookup(a, b, c):
        reverse_lookup = {x:i for i, x in enumerate(b)}
        for i, x in enumerate(a):
            c[i] = reverse_lookup.get(x, -1)
        return c
    
    def profile():
        Nls = [x for x in range(1000,20000,1000)]
        number_iterations = 10
        methods = [method_in, method_try, method_set_in, method_bisect, method_reverse_lookup]
        time_methods = [[] for _ in range(len(methods))]
    
        for N in Nls:
            a = [x for x in range(0,N)]
            random.shuffle(a)
            b = [x for x in range(0,N)]
            random.shuffle(b)
            c = [0 for x in range(0,N)]
    
            for i, func in enumerate(methods):
                wrapped = wrapper(func, a, b, c)
                time_methods[i].append(math.log(timeit.timeit(wrapped, number=number_iterations)))
    
        markers = itertools.cycle(('o', '+', '.', '>', '2'))
        colors = itertools.cycle(('r', 'b', 'g', 'y', 'c'))
        labels = itertools.cycle(('in', 'try', 'set', 'bisect', 'reverse'))
    
        for i in range(len(time_methods)):
            plt.plot(Nls,time_methods[i],marker = next(markers),color=next(colors),linestyle='-',label=next(labels))
    
        plt.xlabel('list size', fontsize=18)
        plt.ylabel('log(time)', fontsize=18)
        plt.legend(loc = 'upper left')
        plt.show()
    
    profile()
    
    0 讨论(0)
  • 2020-11-22 00:49

    As stated by others, in can be very slow for large lists. Here are some comparisons of the performances for in, set and bisect. Note the time (in second) is in log scale.

    Code for testing:

    import random
    import bisect
    import matplotlib.pyplot as plt
    import math
    import time
    
    
    def method_in(a, b, c):
        start_time = time.time()
        for i, x in enumerate(a):
            if x in b:
                c[i] = 1
        return time.time() - start_time
    
    
    def method_set_in(a, b, c):
        start_time = time.time()
        s = set(b)
        for i, x in enumerate(a):
            if x in s:
                c[i] = 1
        return time.time() - start_time
    
    
    def method_bisect(a, b, c):
        start_time = time.time()
        b.sort()
        for i, x in enumerate(a):
            index = bisect.bisect_left(b, x)
            if index < len(a):
                if x == b[index]:
                    c[i] = 1
        return time.time() - start_time
    
    
    def profile():
        time_method_in = []
        time_method_set_in = []
        time_method_bisect = []
    
        # adjust range down if runtime is to great or up if there are to many zero entries in any of the time_method lists
        Nls = [x for x in range(10000, 30000, 1000)]
        for N in Nls:
            a = [x for x in range(0, N)]
            random.shuffle(a)
            b = [x for x in range(0, N)]
            random.shuffle(b)
            c = [0 for x in range(0, N)]
    
            time_method_in.append(method_in(a, b, c))
            time_method_set_in.append(method_set_in(a, b, c))
            time_method_bisect.append(method_bisect(a, b, c))
    
        plt.plot(Nls, time_method_in, marker='o', color='r', linestyle='-', label='in')
        plt.plot(Nls, time_method_set_in, marker='o', color='b', linestyle='-', label='set')
        plt.plot(Nls, time_method_bisect, marker='o', color='g', linestyle='-', label='bisect')
        plt.xlabel('list size', fontsize=18)
        plt.ylabel('log(time)', fontsize=18)
        plt.legend(loc='upper left')
        plt.yscale('log')
        plt.show()
    
    
    profile()
    
    0 讨论(0)
  • 2020-11-22 00:54
    7 in a
    

    Clearest and fastest way to do it.

    You can also consider using a set, but constructing that set from your list may take more time than faster membership testing will save. The only way to be certain is to benchmark well. (this also depends on what operations you require)

    0 讨论(0)
  • 2020-11-22 00:55

    Be aware that the in operator tests not only equality (==) but also identity (is), the in logic for lists is roughly equivalent to the following (it's actually written in C and not Python though, at least in CPython):

    for element in s:
        if element is target:
            # fast check for identity implies equality
            return True
        if element == target:
            # slower check for actual equality
            return True
    return False
    

    In most circumstances this detail is irrelevant, but in some circumstances it might leave a Python novice surprised, for example, numpy.NAN has the unusual property of being not being equal to itself:

    >>> import numpy
    >>> numpy.NAN == numpy.NAN
    False
    >>> numpy.NAN is numpy.NAN
    True
    >>> numpy.NAN in [numpy.NAN]
    True
    

    To distinguish between these unusual cases you could use any() like:

    >>> lst = [numpy.NAN, 1 , 2]
    >>> any(element == numpy.NAN for element in lst)
    False
    >>> any(element is numpy.NAN for element in lst)
    True 
    

    Note the in logic for lists with any() would be:

    any(element is target or element == target for element in lst)
    

    However, I should emphasize that this is an edge case, and for the vast majority of cases the in operator is highly optimised and exactly what you want of course (either with a list or with a set).

    0 讨论(0)
提交回复
热议问题