问题
I need to take exactly 10 files (images) in a dataset randomly, but this dataset is hierarchically structured.
So I need that for each subdirectory that contains images hold just 10 of them randomly. Is there an easy way to do that or I should do it manually?
def getListOfFiles(dirName):
### create a list of file and sub directories
### names in the given directory
listOfFile = os.listdir(dirName)
allFiles = list()
### Iterate over all the entries
for entry in listOfFile:
### Create full path
fullPath = os.path.join(dirName, entry)
### If entry is a directory then get the list of files in this directory
if os.path.isdir(fullPath):
allFiles = allFiles + getListOfFiles(fullPath)
else:
allFiles.append(random.sample(fullPath, 10))
return allFiles
dirName = 'C:/Users/bla/bla'
### Get the list of all files in directory tree at given path
listOfFiles = getListOfFiles(dirName)
with open("elements.txt", mode='x') as f:
for elem in listOfFiles:
f.write(elem + '\n')
回答1:
Good approach to sample from unknown size directory listing is to use Reservoir Sampling. With this approach, you don't have to run upfront and list all files in the directory. Read it one-by-one and sample. It even works when you have to sample fixed number of files across multiple directories.
It would be good to use generator-based directory scanning code, which picks one file at a time, thus you don't use gobs of memory upfront to hold all file names.
Along the lines (NB! undested code!)
import numpy as np
import os
def ResSampleFiles(dirname, N):
"""pick N files from directory"""
sampled_files = list()
k = 0
for item in scandir(dirname):
if item.is_dir():
continue
full_path = os.path.join(dirname, item.name)
if k < N:
sampled_files.append(full_path)
else:
idx = np.random.randint(0, k+1)
if (idx < N):
sampled_files[idx] = full_path
k += 1
return sampled_files
来源:https://stackoverflow.com/questions/60298439/select-randomly-x-files-in-subdirectories