Python random lines from subfolders

给你一囗甜甜゛ 提交于 2019-11-28 09:28:23
Martijn Pieters

To get a proper random distribution across all these files, you'd need to view them as one big set of lines and pick 10 at random. In other words, you'll have to read all these files at least once to at least figure out how many lines you have.

You do not need to hold all the lines in memory however. You'd have to do this in two phases: index your files to count the number of lines in each, then pick 10 random lines to be read from these files.

First indexing:

import os

root_path = r'C:\Tasks\\'
total_lines = 0
file_indices = dict()

# Based on https://stackoverflow.com/q/845058, bufcount function
def linecount(filename, buf_size=1024*1024):
    with open(filename) as f:
        return sum(buf.count('\n') for buf in iter(lambda: f.read(buf_size), ''))

for dirpath, dirnames, filenames in os.walk(root_path):
    for filename in filenames:
         if not filename.endswith('.txt'):
             continue
         path = os.path.join(dirpath, filename)
         file_indices[total_lines] = path
         total_lines += linecount(path)

offsets = list(file_indices.keys())
offsets.sort()

Now we have a mapping of offsets, pointing to filenames, and a total line count. Now we pick ten random indices, and read these from your files:

import random
import bisect

tasks = list(range(total_lines))
task_indices = random.sample(tasks, 10)

for index in task_indices:
     # find the closest file index
     file_index = offsets[bisect.bisect(offsets, index) - 1]
     path = file_indices[file_index]
     curr_line = file_index
     with open(path) as f:
         while curr_line <= index:
             task = f.readline()
             curr_line += 1
     print(task)
     tasks.remove(index)

Note that you only need the indexing once; you can store the result somewhere and only update it when your files update.

Also note that your tasks are now 'stored' in the tasks list; these are indices to lines in your files, and I remove the index from that variable when printing the task selected. Next time you run the random.sample() choices, the tasks previously picked will no longer be available for picking the next time. This structure will need updating if your files ever do change, as the indexes have to be re-calculated. The file_indices will help you with that task, but that is outside the scope of this answer. :-)

If you need only one 10-item sample, use Blckknght's solution instead, as it only will go through the files once, while mine require 10 extra file openings. If you need multiple samples, this solution only requires 10 extra file openings every time you need your sample, it won't scan through all the files again. If you have fewer than 10 files, still use Blckknght's answer. :-)

Blckknght

Here's a simple solution that makes just one pass through the files per sample. If you know exactly how many items you will be sampling from the files, it is probably optimal.

First off is the sample function. This uses the same algorithm that @NedBatchelder linked to in a comment on an earlier answer (though the Perl code shown there only selected a single line, rather than several). It selects values from of an iterable of lines, and only requires the currently selected lines to be kept in memory at any given time (plus the next candidate line). It raises a ValueError if the iterable has fewer values than the requested sample size.

import random

def random_sample(n, items):
    results = []

    for i, v in enumerate(items):
        r = random.randint(0, i)
        if r < n:
            if i < n:
                results.insert(r, v) # add first n items in random order
            else:
                results[r] = v # at a decreasing rate, replace random items

    if len(results) < n:
        raise ValueError("Sample larger than population.")

    return results

edit: In another question, user @DzinX noticed that the use of insert in this code makes the performance bad (O(N^2)) if you're sampling a very large number of values. His improved version which avoids that issue is here. /edit

Now we just need to make a suitable iterable of items for our function to sample from. Here's how I'd do it using a generator. This code will only keep one file open at a time, and it does not need more than one line in memory at a time. The optional exclude parameter, if present, should be a set containing lines that have been selected on a previous run (and so should not be yielded again).

import os

def lines_generator(base_folder, exclude = None):
    for dirpath, dirs, files in os.walk(base_folder):
        for filename in files:
            if filename.endswith(".txt"):
                fullPath = os.path.join(dirpath, filename)
                with open(fullPath) as f:
                     for line in f:
                         cleanLine = line.strip()
                         if exclude is None or cleanLine not in exclude:
                             yield cleanLine

Now, we just need a wrapper function to tie those two pieces together (and manage a set of seen lines). It can return a single sample of size n or a list of count samples, taking advantage of the fact that a slice from a random sample is also a random sample.

_seen = set()

def get_sample(n, count = None):
    base_folder = r"C:\Tasks"
    if count is None:
        sample = random_sample(n, lines_generator(base_folder, _seen))
        _seen.update(sample)
        return sample
    else:
        sample = random_sample(count * n, lines_generator(base_folder, _seen))
        _seen.update(sample)
        return [sample[i * n:(i + 1) * n] for i in range(count)]

Here's how it can be used:

def main():
    s1 = get_sample(10)
    print("Sample1:", *s1, sep="\n")

    s2, s3 = get_sample(10,2) # get two samples with only one read of the files
    print("\nSample2:", *s2, sep="\n")
    print("\nSample3:", *s3, sep="\n")

    s4 = get_sample(5000) # this will probably raise a ValueError!

EDIT: On closer scrutiny this answer does not fit the bill. Reworking it led me to the reservoir sampling algorithm, which @Blckknght used in his answer. So ignore this answer.

Few ways of doing it. Here's one...

  1. Get a list of all task files
  2. Select one at random
  3. Select a single line from that file at random
  4. Repeat until we have the desired number of lines

The code...

import os
import random

def file_iterator(top_dir):
    """Gather all task files"""
    files = []
    for dirpath, dirnames, filenames in os.walk(top_dir):
        for filename in filenames:
            if not filename.endswith('.txt'):
                continue
            path = os.path.join(dirpath, filename)
            files.append(path)
    return files


def random_lines(files, number=10):
    """Select a random file, select a random line until we have enough
    """
    selected_tasks = []

    while len(selected_tasks) < number:
        f = random.choice(files)
        with open(f) as tasks:
            lines = tasks.readlines()
            l = random.choice(lines)
            selected_tasks.append(l)
    return selected_tasks


## Usage
files = file_iterator(r'C:\\Tasks')
random_tasks = random_lines(files)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!