Reading big data with fixed width

后端 未结 3 1303
[愿得一人]
[愿得一人] 2020-12-02 23:21

How can I read big data formated with fixed width? I read this question and tried some tips, but all answers are for delimited data (as .csv), and that\'s not my case. The d

相关标签:
3条回答
  • 2020-12-03 00:10

    Here is a pure R solution using the new package readr, created by Hadley Wickham and the RStudio team, released in April 2015. More info here. The code is as simple as this:

    library(readr)
    
    my.data.frame <- read_fwf('TS_MATRICULA_RS.txt',
                          fwf_widths(c(5, 13, 14, 3, 3, 5, 4, 6, 6, 6, 1, 1, 1, 4, 3, 2, 9, 3, 2, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 4, 11, 9, 2, 3, 9, 3, 2, 9, 9, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1)),
                          progress = interactive())
    

    Advantages of read_fwf{readr}

    • readr is based in LaF but surprisingly faster. It has shown to be the fasted method to read fixed-width files in R
    • It's simpler than the alternatives. e.g. you don't need to worry about column_types because they will be imputed from the first 30 rows on the input.
    • It comes with a progress bar ;)
    0 讨论(0)
  • 2020-12-03 00:16

    The LaF package is pretty good at reading fixed width files very fast. I use it dayly to load in files of +/- 100Mio records with 30 columns (not that much character columns as you have - mainly numeric data and some factors). And it is pretty fast. So this is what I would do.

    library(LaF)
    library(ffbase)
    my.data.laf <- laf_open_fwf('TS_MATRICULA_RS.txt', 
                      column_widths=c(5, 13, 14, 3, 3, 5, 4, 6, 6, 6, 1, 1, 1, 4, 3, 2, 9, 3, 2, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 4, 11, 9, 2, 3, 9, 3, 2, 9, 9, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1), stringsAsFactors=FALSE, comment.char='', 
                      column_types=c('integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'categorical', 'categorical', 'categorical',
                                   'integer', 'integer', 'categorical', 'integer', 'integer', 'categorical', 'integer', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical',
                                   'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical',
                                   'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'integer',
                                   'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'categorical', 'integer', 'integer', 'categorical', 'categorical', 'categorical',
                                   'categorical', 'integer', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical'))
    my.data <- laf_to_ffdf(my.data.laf, nrows=1000000)
    my.data.in.ram <- as.data.frame(my.data)
    

    PS. I started using the LaF package because I was annoyed by the slowness of read.fwf and because the PL/SQL PostgreSQL code which I was working with initially to parse the data was becoming a hassle to maintain.

    0 讨论(0)
  • 2020-12-03 00:19

    Without enough details about your data, it's hard to give a concrete answer, but here are some ideas to get you started:

    First, if you're on a Unix system, you can get some information about your file by using the wc command. For example wc -l TS_MATRICULA_RS.txt will tell you how many lines there are in your file and wc -L TS_MATRICULA_RS.txt will report the length of the longest line in your file. This might be useful to know. Similarly, head and tail would let you inspect the first and last 10 lines of your text file.

    Second, some suggestions: Since it appears that you know the widths of each field, I would recommend one of two approaches.

    Option 1: csvkit + your favorite method to quickly read large data

    csvkit is a set of Python tools for working with CSV files. One of the tools is in2csv, which takes a fixed-width-format file combined with a "schema" file to create a proper CSV that can be used with other programs.

    The schema file is, itself, a CSV file with three columns: (1) variable name, (2) start position, and (3) width. An example (from the in2csv man page) is:

        column,start,length
        name,0,30 
        birthday,30,10 
        age,40,3
    

    Once you have created that file, you should be able to use something like:

    in2csv -f fixed -s path/to/schemafile.csv path/to/TS_MATRICULA_RS.txt > TS_MATRICULA_RS.csv
    

    From there, I would suggest looking into reading the data with fread from "data.table" or using sqldf.

    Option 2: sqldf using substr

    Using sqldf on a large-ish data file like yours should actually be pretty quick, and you get the benefit of being able to specify exactly what you want to read in using substr.

    Again, this will expect that you have a schema file available, like the one described above. Once you have your schema file, you can do the following:

    temp <- read.csv("mySchemaFile.csv")
    
    ## Construct your "substr" command
    GetMe <- paste("select", 
                   paste("substr(V1, ", temp$start, ", ",
                         temp$length, ") `", temp$column, "`", 
                         sep = "", collapse = ", "), 
                   "from fixed", sep = " ")
    
    ## Load "sqldf"
    library(sqldf)
    
    ## Connect to your file
    fixed <- file("TS_MATRICULA_RS.txt")
    myDF <- sqldf(GetMe, file.format = list(sep = "_"))
    

    Since you know the widths, you might be able to skip the generation of the schema file. From the widths, it's just a little bit of work with cumsum. Here's a basic example, building on the first example from read.fwf:

    ff <- tempfile()
    cat(file = ff, "123456", "987654", sep = "\n")
    read.fwf(ff, widths = c(1, 2, 3))
    
    widths <- c(1, 2, 3)
    length <- cumsum(widths)
    start <- length - widths + 1
    column <- paste("V", seq_along(length), sep = "")
    
    GetMe <- paste("select", 
                   paste("substr(V1, ", start, ", ",
                         widths, ") `", column, "`", 
                         sep = "", collapse = ", "), 
                   "from fixed", sep = " ")
    
    library(sqldf)
    
    ## Connect to your file
    fixed <- file(ff)
    myDF <- sqldf(GetMe, file.format = list(sep = "_"))
    myDF
    unlink(ff)
    
    0 讨论(0)
提交回复
热议问题