Reading a huge json file in R , issues

后端 未结 3 1218
北荒
北荒 2020-12-30 11:17

I am trying to read very huge json file using R , and I am using the RJSON library with this commend json_data <- fromJSON(paste(readLines(\"myfile.json\"), collaps

相关标签:
3条回答
  • 2020-12-30 11:43

    I got the same problem while working with huge datasets in R.I had used jsonlite package in R for reading json in R.I had used the following code to read json in R:

    library(jsonlite)
    get_tweets <- stream_in(file("tweets.json"),pagesize = 10000)
    

    here tweets.json is the my file name and the location where it exists,pagesize represents how many number of lines it reads in one iteration.Hope it helps.

    0 讨论(0)
  • 2020-12-30 11:44

    For some reason the above solutions all caused R to terminate or worse.

    This solution worked for me, with the same data set:

    library(jsonlite)
    file_name <- 'C:/Users/Downloads/yelp_dataset/yelp_dataset~/dataset/business.JSON'
    business<-jsonlite::stream_in(textConnection(readLines(file_name, n=100000)),verbose=F)
    

    Took about 15 minutes

    0 讨论(0)
  • 2020-12-30 11:49

    Well, just sharing my experience about read json file. the progress of I am trying to read 52.8MB,19.7MB,1.3GB,93.9MB,158.5MB json files cost me 30minutes and finally auto resume R session, after that tried to apply parallel computing and would like to see the progress but failed.

    https://github.com/hadley/plyr/issues/265

    And then I tried to add the parameter pagesize = 10000, its work and more efficient then ever. Well, we only need read once and later save as RData/Rda/Rds format as by saveRDS.

    > suppressPackageStartupMessages(library('BBmisc'))
    > suppressAll(library('jsonlite'))
    > suppressAll(library('plyr'))
    > suppressAll(library('dplyr'))
    > suppressAll(library('stringr'))
    > suppressAll(library('doParallel'))
    > 
    > registerDoParallel(cores=16)
    > 
    > ## https://www.kaggle.com/c/yelp-recsys-2013/forums/t/4465/reading-json-files-with-r-how-to
    > ## https://class.coursera.org/dsscapstone-005/forum/thread?thread_id=12
    > fnames <- c('business','checkin','review','tip','user')
    > jfile <- paste0(getwd(),'/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_',fnames,'.json')
    > dat <- llply(as.list(jfile), function(x) stream_in(file(x),pagesize = 10000),.parallel=TRUE)
    > dat
    list()
    > jfile
    [1] "/home/ryoeng/Coursera-Data-Science-Capstone/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_business.json"
    [2] "/home/ryoeng/Coursera-Data-Science-Capstone/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_checkin.json" 
    [3] "/home/ryoeng/Coursera-Data-Science-Capstone/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_review.json"  
    [4] "/home/ryoeng/Coursera-Data-Science-Capstone/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_tip.json"     
    [5] "/home/ryoeng/Coursera-Data-Science-Capstone/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_user.json"    
    > dat <- llply(as.list(jfile), function(x) stream_in(file(x),pagesize = 10000),.progress='=')
    opening file input connection.
     Imported 61184 records. Simplifying into dataframe...
    closing file input connection.
    opening file input connection.
     Imported 45166 records. Simplifying into dataframe...
    closing file input connection.
    opening file input connection.
     Found 470000 records...
    
    0 讨论(0)
提交回复
热议问题