问题
During a training script executed on a compute target, we're trying to download a registered Dataset from an ADLS2 Datastore. The problem is that it takes hours to download ~1.5Gb (splitted into ~8500 files) to the compute target with the following method :
from azureml.core import Datastore, Dataset, Run, Workspace
# Retrieve the run context to get Workspace
RUN = Run.get_context(allow_offline=True)
# Retrieve the workspace
ws = RUN.experiment.workspace
# Creating the Dataset object based on a registered Dataset
dataset = Dataset.get_by_name(ws, name='my_dataset_registered')
# Download the Dataset locally
dataset.download(target_path='/tmp/data', overwrite=False)
Important note : the Dataset is registered to a path in the Datalake that contains a lot of subfolders (as well subsubfolders, ..) containing small files of around 170Kb.
Note: I'm able to download the complete dataset to local computer within a few minutes using az copy
or the Storage Explorer. Also, the Dataset is defined at a folder stage with the ** wildcard for scanning subfolders : datalake/relative/path/to/folder/**
Is that a known issue ? How can I improve transfer speed ?
Thanks !
回答1:
Edited to be more answer-like:
It'd be helpful to include: what versions of azureml-core and azureml-dataprep SDK you are using, what type of VM you are running as the compute instance, and what types of files (e.g. jpg? txt?) your dataset is using. Also, what are you trying to achieve by downloading the complete dataset to your compute?
Currently, compute instance image comes with azureml-core 1.0.83 and azureml-dataprep 1.1.35 pre-installed, which are 1-2 months old. You might be using even older versions. You can try upgrading by running in your notebook:
%pip install -U azureml-sdk
If you don't see any improvements to your scenario, you can file an issue on the official docs page to get someone to help debug your issue, such as the ref page for FileDataset.
(edited on June 9, 2020 to remove mention of experimental release because that is not happening anymore)
回答2:
DataTransferStep creates an Azure ML Pipeline step that transfers data between.
Please follow the below for DataTransferStep class. https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.data_transfer_step.datatransferstep?view=azure-ml-py
来源:https://stackoverflow.com/questions/60562966/transfer-from-adls2-to-compute-target-very-slow-azure-machine-learning