Modeling a very big data set (1.8 Million rows x 270 Columns) in R

后端 未结 1 1666
借酒劲吻你
借酒劲吻你 2021-02-10 01:40

I am working on a Windows 8 OS with a RAM of 8 GB . I have a data.frame of 1.8 million rows x 270 columns on which I have to perform a glm. (l

相关标签:
1条回答
  • 2021-02-10 02:00

    I have the impression you are not using ffbase::bigglm.ffdf but you want to. Namely the following will put all your data in RAM and will use biglm::bigglm.function, which is not what you want.

    require(biglm)
    mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = x)
    

    You need to use ffbase::bigglm.ffdf, which works chunkwise on an ffdf. So load package ffbase which exports bigglm.ffdf. If you use ffbase, you can use the following:

    require(ffbase)
    mymodeldataset <- xex[c("OPEN_FLG","new_list_id","NUM_OF_ADULTS_IN_HHLD","OCCUP_MIX_PCT")]
    mymodeldataset$OPEN_FLG <- with(mymodeldataset["OPEN_FLG"], ifelse(OPEN_FLG == "Y", TRUE, FALSE))
    mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = mymodeldataset, family=binomial())
    

    Explanation: Because you don't limit yourself to the columns you use in the model, you will get all your columns of your xex ffdf in RAM which is not needed. You were using a gaussian model on a factor response, bizarre? I believe you were trying to do a logistic regression, so use the appropriate family argument? And it will use ffbase::bigglm.ffdf and not biglm::bigglm.function.

    If that does not work - which I doubt, it is because you have other things in RAM which you are not aware of. In that case do.

    require(ffbase)
    mymodeldataset <- xex[c("OPEN_FLG","new_list_id","NUM_OF_ADULTS_IN_HHLD","OCCUP_MIX_PCT")]
    mymodeldataset$OPEN_FLG <- with(mymodeldataset["OPEN_FLG"], ifelse(OPEN_FLG == "Y", TRUE, FALSE))
    ffsave(mymodeldataset, file = "mymodeldataset")
    
    ## Open R again
    require(ffbase)
    require(biglm)
    ffload("mymodeldataset")
    mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = mymodeldataset, family=binomial())
    

    And off you go.

    0 讨论(0)
提交回复
热议问题