Big Data Process and Analysis in R

后端 未结 4 761
隐瞒了意图╮
隐瞒了意图╮ 2021-02-01 08:36

I know this is not a new concept by any stretch in R, and I have browsed the High Performance and Parallel Computing Task View. With that said, I am asking this question from a

4条回答
  •  时光说笑
    2021-02-01 09:13

    There's a brand-new package called colbycol that lets you read in only the variables you want from enormous text files:

    http://colbycol.r-forge.r-project.org/

    read.table function remains the main data import function in R. This function is memory inefficient and, according to some estimates, it requires three times as much memory as the size of a dataset in order to read it into R.

    The reason for such inefficiency is that R stores data.frames in memory as columns (a data.frame is no more than a list of equal length vectors) whereas text files consist of rows of records. Therefore, R's read.table needs to read whole lines, process them individually breaking into tokens and transposing these tokens into column oriented data structures.

    ColByCol approach is memory efficient. Using Java code, tt reads the input text file and outputs it into several text files, each holding an individual column of the original dataset. Then, these files are read individually into R thus avoiding R's memory bottleneck.

    The approach works best for big files divided into many columns, specially when these columns can be transformed into memory efficient types and data structures: R representation of numbers (in some cases), and character vectors with repeated levels via factors occupy much less space than their character representation.

    Package ColByCol has been successfully used to read multi-GB datasets on a 2GB laptop.

提交回复
热议问题