splitting a XDF File / Dataset for training and testing

风格不统一 提交于 2019-12-11 15:22:09

问题


Is it possible to split a .xdf file in (the Microsoft RevoScaleR context) into a let's say 75% training and 25% test set? I know there is a function called rxSplit(), but, the documentation doesn't seem to apply to this case. Most of the examples online assign a column of random numbers to the dataset, and split it using that column.

Thanks. Thomas


回答1:


You can certainly use rxSplit for this. Create a variable that defines your training and test samples, and then split on it.

For example, using the mtcars toy dataset:

xdf <- rxDataStep(mtcars, "mtcars.xdf")
xdfList <- rxSplit(xdf, splitByFactor="test",
    transforms=list(test=factor(runif(.rxNumRows) < 0.25, levels=c("FALSE", "TRUE"))))

xdfList is now a list containing 2 xdf data sources: one with (approximately) 75% of the data, and the other with 25%.




回答2:


You can use rxDataStep to create the training and testing data sets from the original xdf. Check out this example: https://docs.microsoft.com/en-us/r-server/r/how-to-revoscaler-linear-model

bigDataDir <- "C:/MRS/Data"
sampleAirData <- file.path(bigDataDir, "AirOnTime7Pct.xdf")
trainingDataFile <- "AirlineData06to07.xdf"
targetInfile <- "AirlineData08.xdf"

rxDataStep(sampleAirData, trainingDataFile, rowSelection = Year == 1999 |
    Year == 2000 | Year == 2001 | Year == 2002 | Year == 2003 |
    Year == 2004 | Year == 2005 | Year == 2006 | Year == 2007)
rxDataStep(sampleAirData, targetInfile, rowSelection = Year == 2008)


来源:https://stackoverflow.com/questions/44751473/splitting-a-xdf-file-dataset-for-training-and-testing

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!