Big Data Process and Analysis in R

后端 未结 4 762
隐瞒了意图╮
隐瞒了意图╮ 2021-02-01 08:36

I know this is not a new concept by any stretch in R, and I have browsed the High Performance and Parallel Computing Task View. With that said, I am asking this question from a

相关标签:
4条回答
  • 2021-02-01 08:59

    To read chunks of the JSON file in, you can use the scan() function. Take a look at the skip and nlines arguments. I'm not sure how much performance you'll get versus using a database.

    0 讨论(0)
  • 2021-02-01 09:11

    If you need to operate on the entire 10GB file at once, then I second @Chase's point about getting a larger, possibly cloud-based computer.

    (The Twitter streaming API returns a pretty rich object: a single 140-character tweet could weigh a couple kb of data. You might reduce memory overhead if you preprocess the data outside of R to extract only the content you need, such as author name and tweet text.)

    On the other hand, if your analysis is amenable to segmenting the data -- for example, you want to first group the tweets by author, date/time, etc -- you could consider using Hadoop to drive R.

    Granted, Hadoop will incur some overhead (both cluster setup and learning about the underlying MapReduce model); but if you plan to do a lot of big-data work, you probably want Hadoop in your toolbox anyway.

    A couple of pointers:

    • an example in chapter 7 of Parallel R shows how to setup R and Hadoop for large-scale tweet analysis. The example uses the RHIPE package, but the concepts apply to any Hadoop/MapReduce work.

    • you can also get a Hadoop cluster via AWS/EC2. Check out Elastic MapReduce for an on-demand cluster, or use Whirr if you need more control over your Hadoop deployment.

    0 讨论(0)
  • 2021-02-01 09:11

    10GB of JSON is rather inefficient for storage and analytical purposes. You can use RJSONIO to read it in efficiently. Then, I'd create a memory mapped file. You can use bigmemory (my favorite) to create different types of matrices (character, numeric, etc.), or store everything in one location, e.g. using HDF5 or SQL-esque versions (e.g. see RSQlite).

    What will be more interesting is the number of rows of data and the number of columns.

    As for other infrastructure, e.g. EC2, that's useful, but preparing a 10GB memory mapped file doesn't really require much infrastructure. I suspect you're working with just a few 10s of millions of rows and a few columns (beyond the actual text of the Tweet). This is easily handled on a laptop with efficient use of memory mapped files. Doing complex statistics will require either more hardware, cleverer use of familiar packages, and/or experimenting with some unfamiliar packages. I'd recommend following up with a more specific question when you reach that stage. The first stage of such work is simply data normalization, storage and retrieval. My answer for that is simple: memory mapped files.

    0 讨论(0)
  • 2021-02-01 09:13

    There's a brand-new package called colbycol that lets you read in only the variables you want from enormous text files:

    http://colbycol.r-forge.r-project.org/

    read.table function remains the main data import function in R. This function is memory inefficient and, according to some estimates, it requires three times as much memory as the size of a dataset in order to read it into R.

    The reason for such inefficiency is that R stores data.frames in memory as columns (a data.frame is no more than a list of equal length vectors) whereas text files consist of rows of records. Therefore, R's read.table needs to read whole lines, process them individually breaking into tokens and transposing these tokens into column oriented data structures.

    ColByCol approach is memory efficient. Using Java code, tt reads the input text file and outputs it into several text files, each holding an individual column of the original dataset. Then, these files are read individually into R thus avoiding R's memory bottleneck.

    The approach works best for big files divided into many columns, specially when these columns can be transformed into memory efficient types and data structures: R representation of numbers (in some cases), and character vectors with repeated levels via factors occupy much less space than their character representation.

    Package ColByCol has been successfully used to read multi-GB datasets on a 2GB laptop.

    0 讨论(0)
提交回复
热议问题