K fold cross validation using keras

后端 未结 1 1166
遥遥无期
遥遥无期 2021-02-14 08:01

It seems that k-fold cross validation in convn net is not taken seriously due to huge running time of the neural network. I have a small data-set and I am interested in doing k-

相关标签:
1条回答
  • 2021-02-14 08:41

    If you are using images with data generators, here's one way to do 10-fold cross-validation with Keras and scikit-learn. The strategy is to copy the files to training, validation, and test subfolders according to each fold.

    import numpy as np
    import os
    import pandas as pd
    import shutil
    from sklearn.model_selection import StratifiedKFold
    from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
    
    # used to copy files according to each fold
    def copy_images(df, directory):
        destination_directory = "{path to your data directory}/" + directory
        print("copying {} files to {}...".format(directory, destination_directory))
    
        # remove all files from previous fold
        if os.path.exists(destination_directory):
            shutil.rmtree(destination_directory)
    
        # create folder for files from this fold
        if not os.path.exists(destination_directory):
            os.makedirs(destination_directory)
    
        # create subfolders for each class
        for c in set(list(df['class'])):
            if not os.path.exists(destination_directory + '/' + c):
                os.makedirs(destination_directory + '/' + c)
    
        # copy files for this fold from a directory holding all the files
        for i, row in df.iterrows():
            try:
                # this is the path to all of your images kept together in a separate folder
                path_from = "{path to all of your images}"
                path_from = path_from + "{}.jpg"
                path_to = "{}/{}".format(destination_directory, row['class'])
    
                # move from folder keeping all files to training, test, or validation folder (the "directory" argument)
                shutil.copy(path_from.format(row['filename']), path_to)
            except Exception, e:
                print("Error when copying {}: {}".format(row['filename'], str(e)))
    
    # dataframe containing the filenames of the images (e.g., GUID filenames) and the classes
    df = pd.read_csv('{path to your data}.csv')
    df_y = df['class']
    df_x = df
    del df_x['class']
    
    skf = StratifiedKFold(n_splits = 10)
    total_actual = []
    total_predicted = []
    total_val_accuracy = []
    total_val_loss = []
    total_test_accuracy = []
    
    for i, (train_index, test_index) in enumerate(skf.split(df_x, df_y)):
        x_train, x_test = df_x.iloc[train_index], df_x.iloc[test_index]
        y_train, y_test = df_y.iloc[train_index], df_y.iloc[test_index]
    
        train = pd.concat([x_train, y_train], axis=1)
        test = pd.concat([x_test, y_test], axis = 1)
    
        # take 20% of the training data from this fold for validation during training
        validation = train.sample(frac = 0.2)
    
        # make sure validation data does not include training data
        train = train[~train['filename'].isin(list(validation['filename']))]
    
        # copy the images according to the fold
        copy_images(train, 'training')
        copy_images(validation, 'validation')
        copy_images(test, 'test')
    
        print('**** Running fold '+ str(i))
    
        # here you call a function to create and train your model, returning validation accuracy and validation loss
        val_accuracy, val_loss = create_train_model();
    
        # append validation accuracy and loss for average calculation later on
        total_val_accuracy.append(val_accuracy)
        total_val_loss.append(val_loss)
    
        # here you will call a predict() method that will predict the images on the "test" subfolder 
        # this function returns the actual classes and the predicted classes in the same order
        actual, predicted = predict()
    
        # append accuracy from the predictions on the test data
        total_test_accuracy.append(accuracy_score(actual, predicted))
    
        # append all of the actual and predicted classes for your final evaluation
        total_actual = total_actual + actual
        total_predicted = total_predicted + predicted
    
        # this is optional, but you can also see the performance on each fold as the process goes on
        print(classification_report(total_actual, total_predicted))
        print(confusion_matrix(total_actual, total_predicted))
    
    print(classification_report(total_actual, total_predicted))
    print(confusion_matrix(total_actual, total_predicted))
    print("Validation accuracy on each fold:")
    print(total_val_accuracy)
    print("Mean validation accuracy: {}%".format(np.mean(total_val_accuracy) * 100))
    
    print("Validation loss on each fold:")
    print(total_val_loss)
    print("Mean validation loss: {}".format(np.mean(total_val_loss)))
    
    print("Test accuracy on each fold:")
    print(total_test_accuracy)
    print("Mean test accuracy: {}%".format(np.mean(total_test_accuracy) * 100))
    

    In your predict() function, if you are using a data generator, the only way I could find to keep the predictions in the same order when testing was to use a batch_size of 1:

    generator = ImageDataGenerator().flow_from_directory(
            '{path to your data directory}/test',
            target_size = (img_width, img_height),
            batch_size = 1,
            color_mode = 'rgb',
            # categorical for a multiclass problem
            class_mode = 'categorical',
            # this will also ensure the same order
            shuffle = False)
    

    With this code, I was able to do 10-fold cross-validation using data generators (so I did not have to keep all files in memory). This can be a lot of work if you have millions of images and the batch_size = 1 could be a bottleneck if your test set is large, but for my project this worked well.

    0 讨论(0)
提交回复
热议问题