Python: Choose random line from file, then delete that line

前端 未结 6 1554
礼貌的吻别
礼貌的吻别 2020-11-30 13:16

I\'m new to Python (in that I learned it through a CodeAcademy course) and could use some help with figuring this out.

I have a file, \'TestingDeleteLines.txt\', tha

相关标签:
6条回答
  • 2020-11-30 13:44

    To choose a random line from a file, you could use a space efficient single-pass reservoir-sampling algorithm. To delete that line, you could print everything except the chosen line:

    #!/usr/bin/env python3
    import fileinput
    
    with open(filename) as file:
        k = select_random_it(enumerate(file), default=[-1])[0]
    
    if k >= 0: # file is not empty
        with fileinput.FileInput(filename, inplace=True, backup='.bak') as file:
            for i, line in enumerate(file):
                if i != k: # keep line
                    print(line, end='') # stdout is redirected to filename
    

    where select_random_it() implements the reservoir-sampling algorithm:

    import random
    
    def select_random_it(iterator, default=None, randrange=random.randrange):
        """Return a random element from iterator.
    
        Return default if iterator is empty.
        iterator is exhausted.
        O(n)-time, O(1)-space algorithm.
        """
        # from https://stackoverflow.com/a/1456750/4279
        # select 1st item with probability 100% (if input is one item, return it)
        # select 2nd item with probability 50% (or 50% the selection stays the 1st)
        # select 3rd item with probability 33.(3)%
        # select nth item with probability 1/n
        selection = default
        for i, item in enumerate(iterator, start=1):
            if randrange(i) == 0: # random [0..i)
                selection = item
        return selection
    

    To print k random lines from a file and delete them:

    #!/usr/bin/env python3
    import random
    import sys
    
    k = 10
    filename = 'TestingDeleteLines.txt'
    with open(filename) as file:
        random_lines = reservoir_sample(file, k) # get k random lines
    
    if not random_lines: # file is empty
        sys.exit() # do nothing, exit immediately
    
    print("\n".join(map(str.strip, random_lines))) # print random lines
    delete_lines(filename, random_lines) # delete them from the file
    

    where reservoir_sample() uses the same algorithm as select_random_it() but allows to choose k items instead of one:

    import random
    
    def reservoir_sample(iterable, k,
                         randrange=random.randrange, shuffle=random.shuffle):
        """Select *k* random elements from *iterable*.
    
        Use O(n) Algorithm R https://en.wikipedia.org/wiki/Reservoir_sampling
    
        If number of items less then *k* then return all items in random order.
        """
        it = iter(iterable)
        if not (k > 0):
            raise ValueError("sample size must be positive")
    
        sample = list(islice(it, k)) # fill the reservoir
        shuffle(sample)
        for i, item in enumerate(it, start=k+1):
            j = randrange(i) # random [0..i)
            if j < k:
                sample[j] = item # replace item with gradually decreasing probability
        return sample
    

    and delete_lines() utility function deletes chosen random lines from the file:

    import fileinput
    import os
    
    def delete_lines(filename, lines):
        """Delete *lines* from *filename*."""
        lines = set(lines) # for amortized O(1) lookup
        with fileinput.FileInput(filename, inplace=True, backup='.bak') as file:
            for line in file:
                if line not in lines:
                    print(line, end='')
        os.unlink(filename + '.bak') # remove backup if there is no exception
    

    reservoir_sample(), delete_lines() funciton do not load the whole file into memory and therefore they can work for arbitrary large files.

    0 讨论(0)
  • 2020-11-30 13:44

    Lets assume you have a list of lines from your file stored in items

    >>> items = ['a', 'b', 'c', 'd', 'e', 'f']
    >>> choices = random.sample(items, 2)  # select 2 items
    >>> choices  # here are the two
    ['b', 'c']
    >>> for i in choices:
    ...   items.remove(i)
    ...
    >>> items  # tee daa, no more b or c
    ['a', 'd', 'e', 'f']
    

    From here you would overwrite your previous text file with the contents of items joining with your preferred line ending \r\n or \n. readlines() does not strip line endings so if you use that method, you do not need to add your own line endings.

    0 讨论(0)
  • 2020-11-30 13:45

    Maybe you could try generating 10 random numbers from 0 to 300 using

    deleteLineNums = random.sample(xrange(len(lines)), 10)
    

    and then delete from the lines array by making a copy with list comprehensions:

    linesCopy = [line for idx, line in enumerate(lines) if idx not in deleteLineNums]
    lines[:] = linesCopy
    

    And then writing lines back to 'TestingDeleteLines.txt'.

    To see why the copy code above works, this post might be helpful:

    Remove items from a list while iterating

    EDIT: To get the lines at the randomly generated indices, simply do:

    actualLines = []
    for n in deleteLineNums:
        actualLines.append(lines[n])
    

    Then actualLines contians the actual line text of the randomly generated line indices.

    EDIT: Or even better, use a list comrehension:

    actualLines = [lines[n] for n in deleteLineNums]
    
    0 讨论(0)
  • 2020-11-30 13:46

    I have a file, 'TestingDeleteLines.txt', that's about 300 lines of text. Right now, I'm trying to get it to print me 10 random lines from that file, then delete those lines.

    #!/usr/bin/env python
    import random
    
    k = 10
    filename = 'TestingDeleteLines.txt'
    with open(filename) as file:
        lines = file.read().splitlines()
    
    if len(lines) > k:
        random_lines = random.sample(lines, k)
        print("\n".join(random_lines)) # print random lines
    
        with open(filename, 'w') as output_file:
            output_file.writelines(line + "\n"
                                   for line in lines if line not in random_lines)
    elif lines: # file is too small
        print("\n".join(lines)) # print all lines
        with open(filename, 'wb', 0): # empty the file
            pass
    

    It is O(n**2) algorithm that can be improved if necessary (you don't need it for a tiny file such as your input)

    0 讨论(0)
  • 2020-11-30 13:49

    Point is: you dont "delete" from a file, but rewrite the whole file (or another one) with new content. The canonical way is to read the original file line by line, write back the lines you want to keep to a temporary file, then replace the old file with the new one.

    with open("/path/to/source.txt") as src, open("/path/to/temp.txt", "w") as dest:
        for line in src:
            if should_we_keep_this_line(line):
                dest.write(line)
    os.rename("/path/to/temp.txt", "/path/to/source.txt")
    
    0 讨论(0)
  • 2020-11-30 13:54

    What about list.pop - it gives you the item and update the list in one step.

    lines = readlines()
    deleted = []
    
    indices_to_delete = random.sample(xrange(len(lines)), 10)
    
    # sort to delete biggest index first 
    indices_to_delete.sort(reverse=True)
    
    for i in indices_to_delete:
        # lines.pop(i) delete item at index i and return the item
        # do you need it or its index in the original file than
        deleted.append((i, lines.pop(i)))
    
    # write the updated *lines* back to the file or new file ?!
    # and you have everything in deleted if you need it again
    
    0 讨论(0)
提交回复
热议问题