I have many tasks in .txt files in multiple sub folders. I am trying to pick up a total 10 tasks randomly from these folders, their contained files and finally a text line w
EDIT: On closer scrutiny this answer does not fit the bill. Reworking it led me to the reservoir sampling algorithm, which @Blckknght used in his answer. So ignore this answer.
Few ways of doing it. Here's one...
The code...
import os
import random
def file_iterator(top_dir):
"""Gather all task files"""
files = []
for dirpath, dirnames, filenames in os.walk(top_dir):
for filename in filenames:
if not filename.endswith('.txt'):
continue
path = os.path.join(dirpath, filename)
files.append(path)
return files
def random_lines(files, number=10):
"""Select a random file, select a random line until we have enough
"""
selected_tasks = []
while len(selected_tasks) < number:
f = random.choice(files)
with open(f) as tasks:
lines = tasks.readlines()
l = random.choice(lines)
selected_tasks.append(l)
return selected_tasks
## Usage
files = file_iterator(r'C:\\Tasks')
random_tasks = random_lines(files)
To get a proper random distribution across all these files, you'd need to view them as one big set of lines and pick 10 at random. In other words, you'll have to read all these files at least once to at least figure out how many lines you have.
You do not need to hold all the lines in memory however. You'd have to do this in two phases: index your files to count the number of lines in each, then pick 10 random lines to be read from these files.
First indexing:
import os
root_path = r'C:\Tasks\\'
total_lines = 0
file_indices = dict()
# Based on https://stackoverflow.com/q/845058, bufcount function
def linecount(filename, buf_size=1024*1024):
with open(filename) as f:
return sum(buf.count('\n') for buf in iter(lambda: f.read(buf_size), ''))
for dirpath, dirnames, filenames in os.walk(root_path):
for filename in filenames:
if not filename.endswith('.txt'):
continue
path = os.path.join(dirpath, filename)
file_indices[total_lines] = path
total_lines += linecount(path)
offsets = list(file_indices.keys())
offsets.sort()
Now we have a mapping of offsets, pointing to filenames, and a total line count. Now we pick ten random indices, and read these from your files:
import random
import bisect
tasks = list(range(total_lines))
task_indices = random.sample(tasks, 10)
for index in task_indices:
# find the closest file index
file_index = offsets[bisect.bisect(offsets, index) - 1]
path = file_indices[file_index]
curr_line = file_index
with open(path) as f:
while curr_line <= index:
task = f.readline()
curr_line += 1
print(task)
tasks.remove(index)
Note that you only need the indexing once; you can store the result somewhere and only update it when your files update.
Also note that your tasks are now 'stored' in the tasks
list; these are indices to lines in your files, and I remove the index from that variable when printing the task selected. Next time you run the random.sample()
choices, the tasks previously picked will no longer be available for picking the next time. This structure will need updating if your files ever do change, as the indexes have to be re-calculated. The file_indices
will help you with that task, but that is outside the scope of this answer. :-)
If you need only one 10-item sample, use Blckknght's solution instead, as it only will go through the files once, while mine require 10 extra file openings. If you need multiple samples, this solution only requires 10 extra file openings every time you need your sample, it won't scan through all the files again. If you have fewer than 10 files, still use Blckknght's answer. :-)
Here's a simple solution that makes just one pass through the files per sample. If you know exactly how many items you will be sampling from the files, it is probably optimal.
First off is the sample function. This uses the same algorithm that @NedBatchelder linked to in a comment on an earlier answer (though the Perl code shown there only selected a single line, rather than several). It selects values from of an iterable of lines, and only requires the currently selected lines to be kept in memory at any given time (plus the next candidate line). It raises a ValueError
if the iterable has fewer values than the requested sample size.
import random
def random_sample(n, items):
results = []
for i, v in enumerate(items):
r = random.randint(0, i)
if r < n:
if i < n:
results.insert(r, v) # add first n items in random order
else:
results[r] = v # at a decreasing rate, replace random items
if len(results) < n:
raise ValueError("Sample larger than population.")
return results
edit: In another question, user @DzinX noticed that the use of insert
in this code makes the performance bad (O(N^2)
) if you're sampling a very large number of values. His improved version which avoids that issue is here. /edit
Now we just need to make a suitable iterable of items for our function to sample from. Here's how I'd do it using a generator. This code will only keep one file open at a time, and it does not need more than one line in memory at a time. The optional exclude
parameter, if present, should be a set
containing lines that have been selected on a previous run (and so should not be yielded again).
import os
def lines_generator(base_folder, exclude = None):
for dirpath, dirs, files in os.walk(base_folder):
for filename in files:
if filename.endswith(".txt"):
fullPath = os.path.join(dirpath, filename)
with open(fullPath) as f:
for line in f:
cleanLine = line.strip()
if exclude is None or cleanLine not in exclude:
yield cleanLine
Now, we just need a wrapper function to tie those two pieces together (and manage a set of seen lines). It can return a single sample of size n
or a list of count
samples, taking advantage of the fact that a slice from a random sample is also a random sample.
_seen = set()
def get_sample(n, count = None):
base_folder = r"C:\Tasks"
if count is None:
sample = random_sample(n, lines_generator(base_folder, _seen))
_seen.update(sample)
return sample
else:
sample = random_sample(count * n, lines_generator(base_folder, _seen))
_seen.update(sample)
return [sample[i * n:(i + 1) * n] for i in range(count)]
Here's how it can be used:
def main():
s1 = get_sample(10)
print("Sample1:", *s1, sep="\n")
s2, s3 = get_sample(10,2) # get two samples with only one read of the files
print("\nSample2:", *s2, sep="\n")
print("\nSample3:", *s3, sep="\n")
s4 = get_sample(5000) # this will probably raise a ValueError!