Returning lines that differ between two files (Python)

前端 未结 3 700
我寻月下人不归
我寻月下人不归 2021-02-06 17:38

I have two files with tens of thousands of lines each, output1.txt and output2.txt. I want to iterate through both files and return the line (and content) of the lines that diff

相关标签:
3条回答
  • 2021-02-06 17:55

    7.4. difflib — Helpers for computing deltas

    New in version 2.1.

    This module provides classes and functions for comparing sequences. It can be used for example, for comparing files, and can produce difference information in various formats, including HTML and context and unified diffs. For comparing directories and files, see also, the filecmp module.

    0 讨论(0)
  • 2021-02-06 17:58

    You can do something like this:

    import difflib, sys
    
    tl=100000    # large number of lines
    
    # create two test files (Unix directories...)
    
    with open('/tmp/f1.txt','w') as f:
        for x in range(tl):
            f.write('line {}\n'.format(x))
    
    with open('/tmp/f2.txt','w') as f:
        for x in range(tl+10):   # add 10 lines
            if x in (500,505,1000,tl-2):
                continue         # skip these lines
            f.write('line {}\n'.format(x))        
    
    with open('/tmp/f1.txt','r') as f1, open('/tmp/f2.txt','r') as f2:
        diff = difflib.ndiff(f1.readlines(),f2.readlines())    
        for line in diff:
            if line.startswith('-'):
                sys.stdout.write(line)
            elif line.startswith('+'):
                sys.stdout.write('\t\t'+line)   
    

    Prints (in 400 ms):

    - line 500
    - line 505
    - line 1000
    - line 99998
            + line 100000
            + line 100001
            + line 100002
            + line 100003
            + line 100004
            + line 100005
            + line 100006
            + line 100007
            + line 100008
            + line 100009
    

    If you want the line number, use enumerate:

    with open('/tmp/f1.txt','r') as f1, open('/tmp/f2.txt','r') as f2:
        diff = difflib.ndiff(f1.readlines(),f2.readlines())    
        for i,line in enumerate(diff):
            if line.startswith(' '):
                continue
            sys.stdout.write('My count: {}, text: {}'.format(i,line))  
    
    0 讨论(0)
  • 2021-02-06 18:21

    As long as you don't care about order you could use:

    with open('file1') as f:
        t1 = f.read().splitlines()
        t1s = set(t1)
    
    with open('file2') as f:
        t2 = f.read().splitlines()
        t2s = set(t2)
    
    #in file1 but not file2
    print "Only in file1"
    for diff in t1s-t2s:
        print t1.index(diff), diff
    
    #in file2 but not file1
    print "Only in file2"
    for diff in t2s-t1s:
        print t2.index(diff), diff
    

    Edit: If you do care about order and they're mostly the same then why not just use the command diff?

    0 讨论(0)
提交回复
热议问题