Find all the numbers in one file that are not in another file in python

前端未结

关注

 5  2051

There are two files, say FileA and FileB and we need to find all the numbers that are in FileA which is not there in FileB. All the numbers in the FileA are sorted and all t

相关标签:

5条回答

旧时难觅i

2021-01-04 02:32

If you want to read the files line by line since you don't have so much memory and you need a linear solution you can do this with iter if your files are line based, otherwise see this:

First in your terminal you can do this to generate some test files:

seq 0 3 100 > 3k.txt
seq 0 2 100 > 2k.txt

Then you run this code:

i1 = iter(open("3k.txt"))
i2 = iter(open("2k.txt"))
a = int(next(i1))
b = int(next(i2))
aNotB = []
# bNotA = []
while True:
    try:
        if a < b:
            aNotB += [a]
            a = int(next(i1, None))
        elif a > b:
            # bNotA += [a]
            b = int(next(i2, None))
        elif a == b:
            a = int(next(i1, None))
            b = int(next(i2, None))
    except TypeError:
        if not b:
            aNotB += list(i1)
            break
        else:
            # bNotA += list(i1)
            break
print(aNotB)

Output:

[3, 9, 15, 21, 27, 33, 39, 45, 51, 57, 63, 69, 75, 81, 87, 93, 99] If you want both the result for aNotB and bNotA you can uncomment those two lines.

Timing comparison with Andrej Kesely's answer:

$ seq 0 3 1000000 > 3k.txt
$ seq 0 2 1000000 > 2k.txt
$ time python manual_iter.py        
python manual_iter.py  0.38s user 0.00s system 99% cpu 0.387 total
$ time python heapq_groupby.py        
python heapq_groupby.py  1.11s user 0.00s system 99% cpu 1.116 total

0 讨论(0)

伪装坚强ぢ

2021-01-04 02:32
As files are sorted you can just iterate through each line at a time, if the line of file A is less than the line of file B then you know that A is not in B so you then increment file A only and then check again. If the line in A is greater than the line in B then you know that B is not in A so you increment file B only. If A and B are equal then you know line is in both so increment both files. while in your original question you stated you were interested in entries which are in A but not B, this answer will extend that and also give entries in B not A. This extends the flexability but still allows you so print just those in A not B.
```
def strip_read(file):
    return file.readline().rstrip()

in_a_not_b = []
in_b_not_a = []
with open("fileA") as A:
    with open("fileB") as B:
        Aline = strip_read(A)
        Bline = strip_read(B)
        while Aline or Bline:
            if Aline < Bline and Aline:
                in_a_not_b.append(Aline)
                Aline = strip_read(A)
            elif Aline > Bline and Bline:
                in_b_not_a.append(Bline)
                Bline = strip_read(B)
            else:
                Aline = strip_read(A)
                Bline = strip_read(B)

print("in A not in B", in_a_not_b, "\nin B not in A", in_b_not_a)
```
OUTPUT for my sample Files
```
in A not in B ['2', '5', '7'] 
in B not in A ['6']
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

隐瞒了意图╮

2021-01-04 02:33

You can combine itertools.groupby (doc) and heapq.merge (doc) to iterate through FileA and FileB lazily (it works as long the files are sorted!)

FileA = [1, 1, 2, 3, 4, 5]
FileB = [1, 3, 4, 6]

from itertools import groupby
from heapq import merge

gen_a = ((v, 'FileA') for v in FileA)
gen_b = ((v, 'FileB') for v in FileB)

for v, g in groupby(merge(gen_a, gen_b, key=lambda k: int(k[0])), lambda k: int(k[0])):
    if any(v[1] == 'FileB' for v in g):
        continue
    print(v)

Prints:

2
5

EDIT (Reading from files):

from itertools import groupby
from heapq import merge

gen_a = ((int(v.strip()), 1) for v in open('3k.txt'))
gen_b = ((int(v.strip()), 2) for v in open('2k.txt'))

for v, g in groupby(merge(gen_a, gen_b, key=lambda k: k[0]), lambda k: k[0]):
    if any(v[1] == 2 for v in g):
        continue
    print(v)

Benchmark:

Generating files with 10_000_000 items:

seq 0 3 10000000 > 3k.txt
seq 0 2 10000000 > 2k.txt

The script takes ~10sec to complete:

real    0m10,656s
user    0m10,557s
sys 0m0,076s

0 讨论(0)

不要未来只要你来

2021-01-04 02:47

This is similar to the classic Knuth Sorting and Searching. You may wish to consider reading stack question, on-line lecture notes pdf, and Wikipedia. The stack question mentions something that I agree with, which is using unix sort command. Always, always test with your own data to ensure the method chosen is the most efficient for your data because some of these algorithms are data dependant.

0 讨论(0)
发布评论:

提交评论
- 加载中...

萌比男神i

2021-01-04 02:50

A simple solution based on file reading (asuming that each line hold a number):

results = []
with open('file1.csv') as file1, open('file2.csv') as file2:
        var1 = file1.readline()
        var2 = file2.readline()
        while var1:
            while var1 and var2:
                if int(var1) < int(var2):
                    results.append(int(var1))
                    var1 = file1.readline()
                elif int(var1) > int(var2):
                    var2 = file2.readline()
                elif int(var1) == int(var2):
                    var1 = file1.readline()
                    var2 = file2.readline()
            if var1:
                results.append(int(var1))
                var1 = file1.readline()
print(results)
output = [2, 5, 7, 9]

0 讨论(0)