How to handle the frequent changes in dataset in azure Machine Learning studio?

问题

How to handle the frequent changes in the dataset in Azure Machine Learning Studio. My dataset may change over time, I need to add more rows to dataset. How will I refresh the dataset which I currently use to train the model by using the newly updated dataset. I need this work to be done programmatically(in c# or python) instead of doing it manually in the studio.

回答1:

When registering an AzureML Dataset, no data is moved, just some information like where the data is and how it should be loaded are stored. The purpose is to make accessing the data as simple as calling dataset = Dataset.get(name="my dataset")

In the snippet below (full example), if I register the dataset, I could technically overwrite weather/2018/11.csv with a new version after registering, and my Dataset definition would stay the same, but the new data would be available if you use in it training after overwriting.

# create a TabularDataset from 3 paths in datastore
datastore_paths = [(datastore, 'weather/2018/11.csv'),
                   (datastore, 'weather/2018/12.csv'),
                   (datastore, 'weather/2019/*.csv')]
weather_ds = Dataset.Tabular.from_delimited_files(path=datastore_paths)

However, there are two more recommended approaches (my team does both)

Isolate your data and register a new version of the Dataset, so that you can always roll-back to a previous version of a Dataset version . Dataset Versioning Best Practice
Use a wildcard/glob datapath to refer to a folder that has new data loaded into it on a regular basis. In this way you can have a Dataset that is growing in size over time without having to re-register.

回答2:

Does that works for you? https://stackoverflow.com/a/60639631/12925558

You can manipulate the dataset object

来源：https://stackoverflow.com/questions/60652742/how-to-handle-the-frequent-changes-in-dataset-in-azure-machine-learning-studio

标签

python

azure

azure-machine-learning-studio

azure-machine-learning-service