I have two files with tens of thousands of lines each, output1.txt and output2.txt. I want to iterate through both files and return the line (and content) of the lines that diff
You can do something like this:
import difflib, sys
tl=100000 # large number of lines
# create two test files (Unix directories...)
with open('/tmp/f1.txt','w') as f:
for x in range(tl):
f.write('line {}\n'.format(x))
with open('/tmp/f2.txt','w') as f:
for x in range(tl+10): # add 10 lines
if x in (500,505,1000,tl-2):
continue # skip these lines
f.write('line {}\n'.format(x))
with open('/tmp/f1.txt','r') as f1, open('/tmp/f2.txt','r') as f2:
diff = difflib.ndiff(f1.readlines(),f2.readlines())
for line in diff:
if line.startswith('-'):
sys.stdout.write(line)
elif line.startswith('+'):
sys.stdout.write('\t\t'+line)
Prints (in 400 ms):
- line 500
- line 505
- line 1000
- line 99998
+ line 100000
+ line 100001
+ line 100002
+ line 100003
+ line 100004
+ line 100005
+ line 100006
+ line 100007
+ line 100008
+ line 100009
If you want the line number, use enumerate:
with open('/tmp/f1.txt','r') as f1, open('/tmp/f2.txt','r') as f2:
diff = difflib.ndiff(f1.readlines(),f2.readlines())
for i,line in enumerate(diff):
if line.startswith(' '):
continue
sys.stdout.write('My count: {}, text: {}'.format(i,line))