Groovy/Java: Parallel processing of directory structure where each node is a list of subdirectories/files

六月ゝ 毕业季﹏ 提交于 2019-12-11 15:34:25

问题


Here's my current problem:

I have a directory structure stored inside a cloud storage somewhere. Under the Root folder, I have 1000+ subdirectories and each of those have a single subdirectory under them. And within each of those subdirectories, a single file exists. So a simplified diagram looks something like this:

                      Root
       ________________|________________
      |         |             |         |
   FolderA   FolderB  ...  FolderY   FolderZ
      |         |             |         |
   Folder1   Folder2       Folder3   Folder4
      |         |             |         |
    FileA     FileB         FileC     FileD

For each node, it has properties type ("directory" or "file") and path ("/Root/FolderB"). And the only way to retrieve these nodes is to call a method called listDirectory(path) which goes to the cloud, gets all the objects within that path. I need to find all the files and process them.

The problem is that with the way that it's structured, if I want to look for FileA, I need to call listDirectory() three times (Root -> FolderA -> Folder1) which you can imagine slows the whole thing down significantly.

I want to process this in a parallel manner but I can't seem to get this to work. I've tried doing it recursively by using GParsPool.withPool with eachParallel() but I found out that parallel programming with recursion can be a dangerous (and expensive) slope. I've tried doing it linearly by creating a synchronized list that holds all the paths that are of directories that each thread have visited. But none of these seems to work or provide an efficient solution to this problem.

FYI, I can't change the listDirectory() method. Each call will retrieve all the objects in that path.

TL;DR: I need to find a parallel way to process through a cloud-storage file structure where the only way to get the folders/files are through a listDirectory(path) method.


回答1:


If caching the directory structure in memory by using a deamon is not an option.

or caching the directory structure by initially creating a one time mapping of the storage structure in the memory and hooking into each add remove update operation to the storage and changing the database accordingly is not an option.

assuming the storage structure is a Tree (usually is) because the way listDirectory() works i think you are better off using Breadth first search to search the storage structure tree. that way you can search one level at time using parallel programming

your code could look something like this:

SearchElement.java - represents either a directory or a file

public class SearchElement {

private String path;
private String name;

public SearchElement(String path, String name) {
    this.path = path;
    this.name = name;
}

public String getPath() {
    return path;
}

public String getName() {
    return name;
}

}

ElementFinder.java - a class that searches the storage you need to replace the listDirectory function to your implementation

import java.util.ArrayList;
import java.util.Collection;
import java.util.Optional;
import java.util.Queue;
import java.util.concurrent.ConcurrentLinkedQueue;
import java.util.concurrent.atomic.AtomicReference;

public class ElementFinder {
    private final SearchElement ROOT_DIRECTORY_PATH = new SearchElement("/", "");


    public Optional<SearchElement> find(String elementName) {
        Queue<SearchElement> currentLevelElements = new ConcurrentLinkedQueue();
        currentLevelElements.add(ROOT_DIRECTORY_PATH);

        AtomicReference<Optional<SearchElement>> wantedElement = new AtomicReference<>(Optional.empty());

        while (!currentLevelElements.isEmpty() && wantedElement.get().isEmpty()) {
            Queue<SearchElement> nextLevelElements = new ConcurrentLinkedQueue();
            currentLevelElements.parallelStream().forEach(currentSearchElement -> {
                Collection<SearchElement> subDirectoriesAndFiles = listDirectory(currentSearchElement.getPath());

                subDirectoriesAndFiles.stream()
                        .filter(searchElement -> searchElement.getName().equals(elementName))
                        .findAny()
                        .ifPresent(element -> wantedElement.set(Optional.of(element)));

                nextLevelElements.addAll(subDirectoriesAndFiles);
            });

            currentLevelElements = nextLevelElements;
        }

        return wantedElement.get();
    }

    private Collection<SearchElement> listDirectory(String path) {
        return new ArrayList<>(); // replace me!
    }
}


来源:https://stackoverflow.com/questions/58226567/groovy-java-parallel-processing-of-directory-structure-where-each-node-is-a-lis

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!