Quickly find differences between two large text files

后端 未结 5 1659
野性不改
野性不改 2021-01-02 06:41

I have two 3GB text files, each file has around 80 million lines. And they share 99.9% identical lines (file A has 60,000 unique lines, file B has 80,000 unique lines).

相关标签:
5条回答
  • 2021-01-02 07:19

    Python has difflib which claims to be quite competitive with other diff utilities see: http://docs.python.org/library/difflib.html

    0 讨论(0)
  • 2021-01-02 07:25

    I think this is the fastest method (whether it's in Python or another language shouldn't matter too much IMO).

    Notes:

    1.I only store each line's hash to save space (and time if paging might occur)

    2.Because of the above, I only print out line numbers; if you need actual lines, you'd just need to read the files in again

    3.I assume that the hash function results in no conflicts. This is nearly, but not perfectly, certain.

    4.I import hashlib because the built-in hash() function is too short to avoid conflicts.

    import sys
    import hashlib
    
    file = []
    lines = []
    for i in range(2):
        # open the files named in the command line
        file.append(open(sys.argv[1+i], 'r'))
        # stores the hash value and the line number for each line in file i
        lines.append({})
        # assuming you like counting lines starting with 1
        counter = 1
        while 1:
            # assuming default encoding is sufficient to handle the input file
            line = file[i].readline().encode()
            if not line: break
            hashcode = hashlib.sha512(line).hexdigest()
            lines[i][hashcode] = sys.argv[1+i]+': '+str(counter)
            counter += 1
    unique0 = lines[0].keys() - lines[1].keys()
    unique1 = lines[1].keys() - lines[0].keys()
    result = [lines[0][x] for x in unique0] + [lines[1][x] for x in unique1]
    
    0 讨论(0)
  • 2021-01-02 07:28

    If I understand correctly, you want the lines of these files without duplicates. This does the job:

    uniqA = set(open('fileA', 'r'))
    
    0 讨论(0)
  • 2021-01-02 07:30

    If order matters, try the comm utility. If order doesn't matter, sort file1 file2 | uniq -u.

    0 讨论(0)
  • 2021-01-02 07:40

    With 60,000 or 80,000 unique lines you could just create a dictionary for each unique line, mapping it to a number. mydict["hello world"] => 1, etc. If your average line is around 40-80 characters this will be in the neighborhood of 10 MB of memory.

    Then read each file, converting it to an array of numbers via the dictionary. Those will fit easily in memory (2 files of 8 bytes * 3GB / 60k lines is less than 1 MB of memory). Then diff the lists. You could invert the dictionary and use it to print out the text of the lines that differ.

    EDIT:

    In response to your comment, here's a sample script that assigns numbers to unique lines as it reads from a file.

    #!/usr/bin/python
    
    class Reader:
    
        def __init__(self, file):
            self.count = 0
            self.dict = {}
            self.file = file
    
        def readline(self):
            line = self.file.readline()
            if not line:
                return None
            if self.dict.has_key(line):
                return self.dict[line]
            else:
                self.count = self.count + 1
                self.dict[line] = self.count
                return self.count
    
    if __name__ == '__main__':
        print "Type Ctrl-D to quit."
        import sys
        r = Reader(sys.stdin)
        result = 'ignore'
        while result:
            result = r.readline()
            print result
    
    0 讨论(0)
提交回复
热议问题