Factoring for linear models - Create lm with one factor

别来无恙 提交于 2019-12-11 22:56:41

问题


This question is a more specific and simplified version of this one.

The dataset I'm using is too large for a single lm or speedlm calculation.
I want to split up my data set in smaller pieces but in doing this, one(or more) of the columns only contains one factor.
The code below is the mininum to reproduce my example. On the bottom of the question I will put my testing script for those interested.

library(speedglm)

iris$Species <- factor(iris$Species)
i <- iris[1:20,]
summary(i)
speedlm(Sepal.Length ~ Sepal.Width + Species , i)

This gets me the following error:

Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
  contrasts can be applied only to factors with 2 or more levels

I have tried to factorize iris$Species but without success. I really don't have a clue how I could fix this now.

How can I include Species into the model? (without increasing the sample size)

Edit:
I know I only have one level: "setosa" but I still need it included in the linear model because I will update the model with more factors eventually, as seen in the example script below


For those interested, here is an example script of what I will use for my actual dataset:

library(speedglm)

testfunction <- function(start.i, end.i) {
  return(iris[start.i:end.i,])
}

  lengthdata <- nrow(iris)
  stepsize <- 20

## attempt to factor
  iris$Species <- factor(iris$Species)

## Creates the iris dataset in split parts
  start.i <- seq(0, lengthdata, stepsize)
  end.i   <- pmin(start.i + stepsize, lengthdata)

  dat <- Map(testfunction, start.i + 1, end.i)

## Loops trough the split iris data
  for (i in dat) {
    if (!exists("lmfit")) {
      lmfit  <- speedlm(Sepal.Length ~ Sepal.Width + Species , i)
    } else if (!exists("lmfit2")) {
      lmfit2 <- updateWithMoreData(lmfit, i)
    } else {
      lmfit2 <- updateWithMoreData(lmfit2, i)
    }
  }
  print(summary(lmfit2))

回答1:


There might be a better way, but if you reorder your rows, each split will contain more levels, and therefore not cause the error. I created a random order, but you might want to do a more systematic way.

library(speedglm)

testfunction <- function(start.i, end.i) {
    return(iris.r[start.i:end.i,])
}

lengthdata <- nrow(iris)
stepsize <- 20

## attempt to factor
iris$Species <- factor(iris$Species)

##Random order
set.seed(1)
iris.r <- iris[sample(nrow(iris)),]

## Creates the iris dataset in split parts
start.i <- seq(0, lengthdata, stepsize)
end.i   <- pmin(start.i + stepsize, lengthdata)

dat <- Map(testfunction, start.i + 1, end.i)

## Loops trough the split iris data
for (i in dat) {
    if (!exists("lmfit")) {
        lmfit  <- speedlm(Sepal.Length ~ Sepal.Width + Species , i)
    } else if (!exists("lmfit2")) {
        lmfit2 <- updateWithMoreData(lmfit, i)
    } else {
        lmfit2 <- updateWithMoreData(lmfit2, i)
    }
}
print(summary(lmfit2))

Edit Instead of the random order, you can use modulo division to generate a spred out index vector in a systematic way:

spred.i <- seq(1, by = 7, length.out = 150) %% 150 + 1
iris.r <- iris[spred.i,]


来源:https://stackoverflow.com/questions/33143257/factoring-for-linear-models-create-lm-with-one-factor

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!