I\'m currently working on a case study for which I need to work on the MNIST database.
The files in this site are said to be in IDX file format. I tried to take a look at th
I tried the above, using:
data <- readBin(to.read, integer(), size = 1, n = 784, endian="big")
but ended up with both positive and negative integers in the image. Consequently, when plotted, using:
plot(as.cimg(data))
I get a grey background with the character in pixels that are darker or lighter than the background.
I then used: (see [1]https://tensorflow.rstudio.com/tfestimators/articles/examples/mnist.html)
data <- readBin(to.read, what = "raw", n = 784, endian="big")
conv <- as.integer(data)
mm <- matrix(conv, 28, 28)
Now I have only positive values (0 to 255), and the plot gives a proper white character on a black background. Which is what I wanted.
Here's how you can do it using Darch
package:
Run
readMNIST('C:/Users/pj_/Dir/')
Which will store test.RData
and train.RData
in your set directory.
When you load these two files in your Workspace, you will be able to see 'testData
', 'testLabels
', 'trainData
' and 'trainLabels
' in your Global Environment.
MNIST dataset is also available in the keras
package.
library(keras)
mnist <- dataset_mnist()
x_train <- mnist$train$x
y_train <- mnist$train$y
x_test <- mnist$test$x
y_test <- mnist$test$y
Following up on the darch
(not ~Darch
~) package mentioned above:
The package is called darch
. It has been moved to MRAN (Microsoft R Application Network) but is available on CRAN as well.
It provides two functions for the MNIST data:
readMNIST
which reads the ubyte files stored in your hard drive and saves them as test.Rdata
and train.Rdata
archives.
provideMNIST
which will download the files and call readMNIST
on them.
When calling these functions you need to give the directory names separated by a single slash e.g. readMNIST("..\MNIST\")
(last slash required).
If you download the files yourself you will need to change the file names: the gz archives contain files with extensions, like t10k-labels.idx1-ubyte but readMNIST
looks for files without extension, like t10k-labels-idx1-ubyte, so you have to change the dot to a dash (with darch
version 0.12.0, maybe they'll fix this).
To load the files in R
you need to use the load
function (e.g. load("..\\MNIST\\test.Rdata")
. This will create the matrices trainData and testData in the environment.
For some reason I did not get any dimnames for the matrices.
endian="big"
, not "high"
:
> to.read = file("~/Downloads/t10k-images-idx3-ubyte", "rb")
magic number:
> readBin(to.read, integer(), n=1, endian="big")
[1] 2051
number of images:
> readBin(to.read, integer(), n=1, endian="big")
[1] 10000
number of rows:
> readBin(to.read, integer(), n=1, endian="big")
[1] 28
number of columns:
> readBin(to.read, integer(), n=1, endian="big")
[1] 28
here comes the data:
> readBin(to.read, integer(), n=1, endian="big")
[1] 0
> readBin(to.read, integer(), n=1, endian="big")
[1] 0
as per the training set image data description on the web site.
Now you just need to loop and read 28*28 byte chunks into matrices.
Start again:
> to.read = file("~/Downloads/t10k-images-idx3-ubyte", "rb")
skip header:
> readBin(to.read, integer(), n=4, endian="big")
[1] 2051 10000 28 28
should really get the 28,28 from the header read but hard-coded here:
> m = matrix(readBin(to.read,integer(), size=1, n=28*28, endian="big"),28,28)
> image(m)
Might need to transpose or flip the matrix, I think its an upside-down "7".
par(mfrow=c(5,5))
par(mar=c(0,0,0,0))
for(i in 1:25){m = matrix(readBin(to.read,integer(), size=1, n=28*28, endian="big"),28,28);image(m[,28:1])}
gets you:
Oh, and google leads me to: http://www.inside-r.org/packages/cran/darch/docs/readMNIST which might be useful.