I have a large data set and like to fit different logistic regression for each City, one of the column in my data. The following 70/30 split works without considering City group.
indexes <- sample(1:nrow(data), size = 0.7*nrow(data))
train <- data[indexes,]
test <- data[-indexes,]
But this does not guarantee the 70/30 split for each city.
lets say that I have City A and City B, where City A has 100 rows, and City B has 900 rows, totaling 1000 rows. Splitting the data with above code will give me 700 rows for train and 300 for test data, but it does not guarantee that i will have 70 rows for City A, and 630 rows for City B in the train data. How do i do that?
Once i have the training data split-ed to 70/30 fashion for each city,i will run logistic regression for each city ( I know how to do this once i have the train data)
Try createDataPartition
from caret
package. Its document states: By default, createDataPartition
does a stratified random split of the data.
train.index <- createDataPartition(Data$Class, p = .7, list = FALSE)
train <- Data[ train.index,]
test <- Data[-train.index,]
it can also be used for stratified K-fold like:
ctrl <- trainControl(method = "repeatedcv",
repeats = 3,
# when calling train, pass this train control
trControl = ctrl,
check out caret document for more details
The package splitstackshape
has a nice function stratified which can do this as well, but this is a bit better than createDataPartition
because it can use multiple columns to stratify at once. It can be used with one column like:
set.seed(42) # good idea to set the random seed for reproducibility
stratified(data, c('City'), 0.7)
Or with multiple columns:
stratified(data, c('City', 'column2'), 0.7)
The typical way is with split
lapply( split(dfrm, dfrm$City), function(dd){
indexes= sample(1:nrow(dd), size = 0.7*nrow(dd))
train= dd[indexes, ] # Notice that you may want all columns
test= dd[-indexes, ]
# analysis goes here
If you were to do it in steps as you attempted above it would be like this:
cities <- split(data,data$city)
idxs <- lapply(cities, function (d) {
indexes <- sample(1:nrow(d), size=0.7*nrow(d))
train <- data[ idxs[[1]], ] # for the first city
test <- data[ -idxs[[1]], ]
I happen to think the is the clumsy way to do it, but perhaps breaking it down into small steps will let you examine the intermediate values.
Your code works just fine as is, if City is a column, simply run training data as train[,2]. You can do this easily for each one with a lambda function
logReg<-function(ind) {
return(val) }
Then run sapply over the vector of city indexes.