Find all the numbers in one file that are not in another file in python

前端 未结 5 2051
灰色年华
灰色年华 2021-01-04 02:08

There are two files, say FileA and FileB and we need to find all the numbers that are in FileA which is not there in FileB. All the numbers in the FileA are sorted and all t

相关标签:
5条回答
  • 2021-01-04 02:32

    If you want to read the files line by line since you don't have so much memory and you need a linear solution you can do this with iter if your files are line based, otherwise see this:

    First in your terminal you can do this to generate some test files:

    seq 0 3 100 > 3k.txt
    seq 0 2 100 > 2k.txt
    

    Then you run this code:

    i1 = iter(open("3k.txt"))
    i2 = iter(open("2k.txt"))
    a = int(next(i1))
    b = int(next(i2))
    aNotB = []
    # bNotA = []
    while True:
        try:
            if a < b:
                aNotB += [a]
                a = int(next(i1, None))
            elif a > b:
                # bNotA += [a]
                b = int(next(i2, None))
            elif a == b:
                a = int(next(i1, None))
                b = int(next(i2, None))
        except TypeError:
            if not b:
                aNotB += list(i1)
                break
            else:
                # bNotA += list(i1)
                break
    print(aNotB)
    

    Output:

    [3, 9, 15, 21, 27, 33, 39, 45, 51, 57, 63, 69, 75, 81, 87, 93, 99] If you want both the result for aNotB and bNotA you can uncomment those two lines.

    Timing comparison with Andrej Kesely's answer:

    $ seq 0 3 1000000 > 3k.txt
    $ seq 0 2 1000000 > 2k.txt
    $ time python manual_iter.py        
    python manual_iter.py  0.38s user 0.00s system 99% cpu 0.387 total
    $ time python heapq_groupby.py        
    python heapq_groupby.py  1.11s user 0.00s system 99% cpu 1.116 total
    
    0 讨论(0)
  • 2021-01-04 02:32

    As files are sorted you can just iterate through each line at a time, if the line of file A is less than the line of file B then you know that A is not in B so you then increment file A only and then check again. If the line in A is greater than the line in B then you know that B is not in A so you increment file B only. If A and B are equal then you know line is in both so increment both files. while in your original question you stated you were interested in entries which are in A but not B, this answer will extend that and also give entries in B not A. This extends the flexability but still allows you so print just those in A not B.

    def strip_read(file):
        return file.readline().rstrip()
    
    in_a_not_b = []
    in_b_not_a = []
    with open("fileA") as A:
        with open("fileB") as B:
            Aline = strip_read(A)
            Bline = strip_read(B)
            while Aline or Bline:
                if Aline < Bline and Aline:
                    in_a_not_b.append(Aline)
                    Aline = strip_read(A)
                elif Aline > Bline and Bline:
                    in_b_not_a.append(Bline)
                    Bline = strip_read(B)
                else:
                    Aline = strip_read(A)
                    Bline = strip_read(B)
    
    print("in A not in B", in_a_not_b, "\nin B not in A", in_b_not_a)
    

    OUTPUT for my sample Files

    in A not in B ['2', '5', '7'] 
    in B not in A ['6']
    
    0 讨论(0)
  • 2021-01-04 02:33

    You can combine itertools.groupby (doc) and heapq.merge (doc) to iterate through FileA and FileB lazily (it works as long the files are sorted!)

    FileA = [1, 1, 2, 3, 4, 5]
    FileB = [1, 3, 4, 6]
    
    from itertools import groupby
    from heapq import merge
    
    gen_a = ((v, 'FileA') for v in FileA)
    gen_b = ((v, 'FileB') for v in FileB)
    
    for v, g in groupby(merge(gen_a, gen_b, key=lambda k: int(k[0])), lambda k: int(k[0])):
        if any(v[1] == 'FileB' for v in g):
            continue
        print(v)
    

    Prints:

    2
    5
    

    EDIT (Reading from files):

    from itertools import groupby
    from heapq import merge
    
    gen_a = ((int(v.strip()), 1) for v in open('3k.txt'))
    gen_b = ((int(v.strip()), 2) for v in open('2k.txt'))
    
    for v, g in groupby(merge(gen_a, gen_b, key=lambda k: k[0]), lambda k: k[0]):
        if any(v[1] == 2 for v in g):
            continue
        print(v)
    

    Benchmark:

    Generating files with 10_000_000 items:

    seq 0 3 10000000 > 3k.txt
    seq 0 2 10000000 > 2k.txt
    

    The script takes ~10sec to complete:

    real    0m10,656s
    user    0m10,557s
    sys 0m0,076s
    
    0 讨论(0)
  • This is similar to the classic Knuth Sorting and Searching. You may wish to consider reading stack question, on-line lecture notes pdf, and Wikipedia. The stack question mentions something that I agree with, which is using unix sort command. Always, always test with your own data to ensure the method chosen is the most efficient for your data because some of these algorithms are data dependant.

    0 讨论(0)
  • 2021-01-04 02:50

    A simple solution based on file reading (asuming that each line hold a number):

    results = []
    with open('file1.csv') as file1, open('file2.csv') as file2:
            var1 = file1.readline()
            var2 = file2.readline()
            while var1:
                while var1 and var2:
                    if int(var1) < int(var2):
                        results.append(int(var1))
                        var1 = file1.readline()
                    elif int(var1) > int(var2):
                        var2 = file2.readline()
                    elif int(var1) == int(var2):
                        var1 = file1.readline()
                        var2 = file2.readline()
                if var1:
                    results.append(int(var1))
                    var1 = file1.readline()
    print(results)
    output = [2, 5, 7, 9]
    
    0 讨论(0)
提交回复
热议问题