问题
Here's my current problem:
I have a directory structure stored inside a cloud storage somewhere. Under the Root folder, I have 1000+ subdirectories and each of those have a single subdirectory under them. And within each of those subdirectories, a single file exists. So a simplified diagram looks something like this:
Root
________________|________________
| | | |
FolderA FolderB ... FolderY FolderZ
| | | |
Folder1 Folder2 Folder3 Folder4
| | | |
FileA FileB FileC FileD
For each node, it has properties type
("directory" or "file") and path
("/Root/FolderB"). And the only way to retrieve these nodes is to call a method called listDirectory(path)
which goes to the cloud, gets all the objects within that path
. I need to find all the files and process them.
The problem is that with the way that it's structured, if I want to look for FileA, I need to call listDirectory()
three times (Root -> FolderA -> Folder1) which you can imagine slows the whole thing down significantly.
I want to process this in a parallel manner but I can't seem to get this to work. I've tried doing it recursively by using GParsPool.withPool
with eachParallel()
but I found out that parallel programming with recursion can be a dangerous (and expensive) slope. I've tried doing it linearly by creating a synchronized list that holds all the paths that are of directories that each thread have visited. But none of these seems to work or provide an efficient solution to this problem.
FYI, I can't change the listDirectory()
method. Each call will retrieve all the objects in that path.
TL;DR: I need to find a parallel way to process through a cloud-storage file structure where the only way to get the folders/files are through a listDirectory(path)
method.
回答1:
If caching the directory structure in memory by using a deamon is not an option.
or caching the directory structure by initially creating a one time mapping of the storage structure in the memory and hooking into each add remove update operation to the storage and changing the database accordingly is not an option.
assuming the storage structure is a Tree (usually is) because the way listDirectory()
works i think you are better off using Breadth first search to search the storage structure tree. that way you can search one level at time using parallel programming
your code could look something like this:
SearchElement.java - represents either a directory or a file
public class SearchElement {
private String path;
private String name;
public SearchElement(String path, String name) {
this.path = path;
this.name = name;
}
public String getPath() {
return path;
}
public String getName() {
return name;
}
}
ElementFinder.java - a class that searches the storage you need to replace the listDirectory function to your implementation
import java.util.ArrayList;
import java.util.Collection;
import java.util.Optional;
import java.util.Queue;
import java.util.concurrent.ConcurrentLinkedQueue;
import java.util.concurrent.atomic.AtomicReference;
public class ElementFinder {
private final SearchElement ROOT_DIRECTORY_PATH = new SearchElement("/", "");
public Optional<SearchElement> find(String elementName) {
Queue<SearchElement> currentLevelElements = new ConcurrentLinkedQueue();
currentLevelElements.add(ROOT_DIRECTORY_PATH);
AtomicReference<Optional<SearchElement>> wantedElement = new AtomicReference<>(Optional.empty());
while (!currentLevelElements.isEmpty() && wantedElement.get().isEmpty()) {
Queue<SearchElement> nextLevelElements = new ConcurrentLinkedQueue();
currentLevelElements.parallelStream().forEach(currentSearchElement -> {
Collection<SearchElement> subDirectoriesAndFiles = listDirectory(currentSearchElement.getPath());
subDirectoriesAndFiles.stream()
.filter(searchElement -> searchElement.getName().equals(elementName))
.findAny()
.ifPresent(element -> wantedElement.set(Optional.of(element)));
nextLevelElements.addAll(subDirectoriesAndFiles);
});
currentLevelElements = nextLevelElements;
}
return wantedElement.get();
}
private Collection<SearchElement> listDirectory(String path) {
return new ArrayList<>(); // replace me!
}
}
来源:https://stackoverflow.com/questions/58226567/groovy-java-parallel-processing-of-directory-structure-where-each-node-is-a-lis