问题
I have a dataset with 9558 rows from three different projects. I want to randomly split this dataset in three equal groups and assign a unique ID for each group, so that Project1_Project_2_Project3
becomes Project1
, Project2
and Project3
.
I have tried many things, and googled codes from people with similar problem as I have. I have used sample_n()
and sample_frac()
, but unfortunately I can't solve this issue myself :/
I have made an example of my dataset looking like this:
ProjectName <- c("Project1_Project2_Project3")
data <- data.frame(replicate(10,sample(0:1,9558,rep=TRUE)))
data <- data.frame(ProjectName, data)
And the output should be randomly split in three equal group of nrow=3186
and then assigned to the values
ProjectName Count of rows
Project1 3186
Project2 3186
Project3 3186
回答1:
IMO it should be sufficient to assign just random project names.
dat$ProjectName <- sample(factor(rep(1:3, length.out=nrow(dat)),
labels=paste0("Project", 1:3)))
Result
head(dat)
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 ProjectName
# 1 1 1 0 1 1 1 1 0 1 0 Project1
# 2 1 1 1 1 1 1 0 0 1 0 Project1
# 3 0 0 1 1 0 0 0 1 1 1 Project1
# 4 1 1 1 0 1 0 1 1 0 1 Project3
# 5 1 0 0 1 1 1 1 0 0 1 Project1
# 6 1 0 0 0 0 1 0 1 1 1 Project3
table(dat$ProjectName)
# Project1 Project2 Project3
# 3186 3186 3186
Data
set.seed(42)
dat <- data.frame(replicate(10, sample(0:1, 9558, rep=TRUE)))
回答2:
Add an id
to data
:
data$id <- 1:nrow(data)
Take the first sample:
project1 <- dplyr::sample_frac(data, 0.33333)
Remove the used rows from data and save into project2
:
project2 <- data[!(data$id %in% project1$id), ]
Sample half of the remainder:
project3 <- dplyr::sample_frac(project2, 0.5)
Finally remove those in the project3
sample from project2
:
project2 <- project2[!(project2$id %in% project3$id), ]
Check all id
s are unique:
# should all be FALSE
any(project1$id %in% project2$id)
any(project1$id %in% project3$id)
any(project2$id %in% project3$id)
And double-check the data frames have the right number of cases:
nrow(project1)
nrow(project2)
nrow(project3)
回答3:
I had this same problem once. This is how I did it. If you just use sample, the groups are uneven, by sampling off a vector where the groups are even worked for me.
sampleframe <- rep(1:3, ceiling( nrow( data)/3 ) )
data$grp <- 0
data[ , "grp" ] <- sample( sampleframe , size=nrow( data) , replace=FALSE )
project1 <- data[data$grp %in% 1 ,]
project2 <- data[data$grp %in% 2 ,]
project3 <- data[data$grp %in% 3 ,]
回答4:
I like the solution in this comment to a Github gist.
You could generate the indices as suggested:
folds <- split(sample(nrow(data), nrow(data), replace = FALSE), as.factor(1:3))
Then get a list of 3 equal size data frames using:
datalist <- lapply(folds, function(x) data[x, ])
来源:https://stackoverflow.com/questions/55375807/how-to-randomly-split-data-into-three-equal-sizes