How to put my dataset in a .pkl file in the exact format and data structure used in “mnist.pkl.gz”?

泄露秘密 提交于 2019-12-02 21:07:28

A .pkl file is not necessary to adapt code from the Theano tutorial to your own data. You only need to mimic their data structure.

Quick fix

Look for the following lines. It's line 303 on DBN.py.

datasets = load_data(dataset)
train_set_x, train_set_y = datasets[0]

Replace with your own train_set_x and train_set_y.

my_x = []
my_y = []
with open('path_to_file', 'r') as f:
    for line in f:
        my_list = line.split(' ') # replace with your own separator instead
        my_x.append(my_list[1:-1]) # omitting identifier in [0] and target in [-1]
        my_y.append(my_list[-1])
train_set_x = theano.shared(numpy.array(my_x, dtype='float64'))
train_set_y = theano.shared(numpy.array(my_y, dtype='float64'))

Adapt this to your input data and the code you're using.

The same thing works for cA.py, dA.py and SdA.py but they only use train_set_x.

Look for places such as n_ins=28 * 28 where mnist image sizes are hardcoded. Replace 28 * 28 with your own number of columns.

Explanation

This is where you put your data in a format that Theano can work with.

train_set_x = theano.shared(numpy.array(my_x, dtype='float64'))
train_set_y = theano.shared(numpy.array(my_y, dtype='float64'))

shared() turns a numpy array into the Theano format designed for efficiency on GPUs.

dtype='float64' is expected in Theano arrays.

More details on basic tensor functionality.

.pkl file

The .pkl file is a way to save your data structure.

You can create your own.

import cPickle
f = file('my_data.pkl', 'wb')
    cPickle.dump((train_set_x, train_set_y), f, protocol=cPickle.HIGHEST_PROTOCOL)
f.close()

More details on loading and saving.

The pickled file represents a tuple of 3 lists : the training set, the validation set and the testing set. (train, val, test)

  • Each of the three lists is a pair formed from a list of images and a list of class labels for each of the images.
  • An image is represented as numpy 1-dimensional array of 784 (28 x 28) float values between 0 and 1 (0 stands for black, 1 for white).
  • The labels are numbers between 0 and 9 indicating which digit the image represents.

This can help:

from PIL import Image
from numpy import genfromtxt
import gzip, cPickle
from glob import glob
import numpy as np
import pandas as pd
Data, y = dir_to_dataset("trainMNISTForm\\*.BMP","trainLabels.csv")
# Data and labels are read 

train_set_x = Data[:2093]
val_set_x = Data[2094:4187]
test_set_x = Data[4188:6281]
train_set_y = y[:2093]
val_set_y = y[2094:4187]
test_set_y = y[4188:6281]
# Divided dataset into 3 parts. I had 6281 images.

train_set = train_set_x, train_set_y
val_set = val_set_x, val_set_y
test_set = test_set_x, val_set_y

dataset = [train_set, val_set, test_set]

f = gzip.open('file.pkl.gz','wb')
cPickle.dump(dataset, f, protocol=2)
f.close()

This is the function I used. May change according to your file details.

def dir_to_dataset(glob_files, loc_train_labels=""):
    print("Gonna process:\n\t %s"%glob_files)
    dataset = []
    for file_count, file_name in enumerate( sorted(glob(glob_files),key=len) ):
        image = Image.open(file_name)
        img = Image.open(file_name).convert('LA') #tograyscale
        pixels = [f[0] for f in list(img.getdata())]
        dataset.append(pixels)
        if file_count % 1000 == 0:
            print("\t %s files processed"%file_count)
    # outfile = glob_files+"out"
    # np.save(outfile, dataset)
    if len(loc_train_labels) > 0:
        df = pd.read_csv(loc_train_labels)
        return np.array(dataset), np.array(df["Class"])
    else:
        return np.array(dataset)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!