merge sort in python

后端 未结 3 1586
栀梦
栀梦 2021-02-06 19:26

basically I have a bunch of files containing domains. I\'ve sorted each individual file based on its TLD using .sort(key=func_that_returns_tld)

now that I\'ve done that

相关标签:
3条回答
  • 2021-02-06 20:05

    Another option (again, only if all your data won't fit into memory) is to create a SQLite3 database and do the sorting there and write it to file after.

    0 讨论(0)
  • 2021-02-06 20:11

    Unless your file is incomprehensibly huge, it will fit into memory.

    Your pseudo-code is hard to read. Please indent your pseudo-code correctly. The final "loop by reading next line" makes no sense.

    Basically, it's this.

    all_data= []
    for f in list_of_files:
        with open(f,'r') as source:
            all_data.extend( source.readlines() )
    all_data.sort(... whatever your keys are... )
    

    You're done. You can write all_data to a file, or process it further or whatever you want to do with it.

    0 讨论(0)
  • 2021-02-06 20:22

    If your files are not very large, then simply read them all into memory (as S. Lott suggests). That would definitely be simplest.

    However, you mention collation creates one "massive" file. If it's too massive to fit in memory, then perhaps use heapq.merge. It may be a little harder to set up, but it has the advantage of not requiring that all the iterables be pulled into memory at once.

    import heapq
    import contextlib
    
    class Domain(object):
        def __init__(self,domain):
            self.domain=domain
        @property
        def tld(self):
            # Put your function for calculating TLD here
            return self.domain.split('.',1)[0]
        def __lt__(self,other):
            return self.tld<=other.tld
        def __str__(self):
            return self.domain
    
    class DomFile(file):
        def next(self):
            return Domain(file.next(self).strip())
    
    filenames=('data1.txt','data2.txt')
    with contextlib.nested(*(DomFile(filename,'r') for filename in filenames)) as fhs:
        for elt in heapq.merge(*fhs):
            print(elt)
    

    with data1.txt:

    google.com
    stackoverflow.com
    yahoo.com
    

    and data2.txt:

    standards.freedesktop.org
    www.imagemagick.org
    

    yields:

    google.com
    stackoverflow.com
    standards.freedesktop.org
    www.imagemagick.org
    yahoo.com
    
    0 讨论(0)
提交回复
热议问题