Massive CSV file into Matlab

前端 未结 3 586
隐瞒了意图╮
隐瞒了意图╮ 2021-01-15 00:18

I have a CSV file 1.6 GB large, that I need to feed into matlab. I will have to do this frequently and I need it to run quickly. The file is of the form:

201         


        
相关标签:
3条回答
  • 2021-01-15 00:27

    The recommended syntax is textscan (http://www.mathworks.com/help/matlab/ref/textscan.html)

    Your code would look like this:

    fid = fopen('C:\Program Files\MATLAB\R2013a\EDU13.csv','r');
    c = textscan(fid, '%d,%d:%d.%d,%f,%d,%c');
    fclose(fid);
    

    You end up with a cell array... whether it's worth converting that to another shape really depends on how you want to access the data afterwards.

    It is quite likely that this would be faster if you include a loop that allows you to use a smaller, fixed amount of memory for much of the operation. One problem with reading large files is the fact that you don't know ahead of time how big it will be - and that very likely means that Matlab guesses the amount of memory it needs, and frequently has to rescale. That is a very slow operation - if it happens every 1MB, say, then it copies 1 MB once, next 2 MB, then again 3 MB, etc - as you can see it is quadratic in the size of the array.

    If instead you allocate a fixed amount of memory for the final result, and process in smaller batches, you avoid all that overhead. I'm pretty sure it will be much faster - but you would have to experiment a bit with the block size. That would look something like this:

    block = 1000;
    Nlines = 35E6; 
    fid = fopen('C:\Program Files\MATLAB\R2013a\EDU13.csv','r');
    c = struct(field1, field2, fieldn, value); %... initialize structure array or other storage for c ...
    c_offset = 0;
    while ~feof(fid)
      temp = textscan(fid, '%d,%d:%d.%d,%f,%d,%c', block);
        bt = size(temp, 1); % first dimension - should be `block`, except for last loop
        %... extract, process, store in c(c_offset + (1:bt))... 
        c_offset = c_offset + bt;
    end
    fclose(fid);
    
    0 讨论(0)
  • 2021-01-15 00:39

    Inspired by @Axon's answer, I implemented a "fast" C program to convert the file to binary, then read it in using Matlab's fread function. Spoiler alert: reading is then 20x faster... although the initial conversion takes a little bit of time.

    To make the job in Matlab easier, and the file size smaller, I am converting each of the number fields into an int16 (short integer). For the first field - which looks like a yyyymmdd field - that involves splitting into two smaller numbers; similarly the decimal numbers are converted to two short integers (given the apparent range I think that is valid). All this is recognizing that "to really optimize, you must really know your problem" - so if assumptions are invalid, the results will be too.

    Here is the C code:

    #include <stdio.h>
    int main(){
      FILE *fp, *fo;
      long int ld1;
      int d2, d3, d4, d5, d6, d7;
      short int buf[9];
      char c8;
      int n;
      short int year, monthday;
      fp = fopen("bigdata.txt", "r");
      fo = fopen("bigdata.bin", "wb");
      if (fp == NULL || fo == NULL) {
        printf("unable to open file\n");
        return 1;
      }
      while(!feof(fp)) {
        n = fscanf(fp, "%ld %d:%d.%d %d.%d %d %c\n", \
          &ld1, &d2, &d3, &d4, &d5, &d6, &d7, &c8);
        year = d1 / 10000;
        monthday = d1 - 10000 * year;
        // move everything into buffer for single call to fwrite:
        buf[0] = year;
        buf[1] = monthday;
        buf[2] = d2;
        buf[3] = d3;
        buf[4] = d4;
        buf[5] = d5;
        buf[6] = d6;
        buf[7] = d7;
        buf[8] = c8;
        fwrite(buf, sizeof(short int), 9, fo);
        }
      fclose(fp);
      fclose(fo);
      return 0;
    }  
    

    The resulting file is about half the size of the original - which is encouraging and will speed up access. Note that it would be a good idea if the output file could be written to a different disk than the input file - it really helps keep data streaming without a lot of time wasted in seek operations.

    Benchmark: using a file of 2 M lines as input, this ran in about 2 seconds (same disk). The resulting binary file is read in Matlab with the following:

    tic
    fid = fopen('bigdata.bin');
    d = fread(fid, 'int16');
    d = reshape(d, 9, []);
    toc
    

    Of course, now if you want to recover the numbers as floating point numbers, you will have to do a little bit of work; but I think it's worth it. One possible problem you will have to solve is the situation where the value after the decimal point has a different number of digits: converting (a,b) into float isn't as simple as "a + b/100" when b > 100... "exercise for the student"?

    A little benchmarking: The above code took about 0.4 seconds. By comparison, my first suggestion with textread took about 9 seconds on the same file; and your original code took a little over 11 seconds. The difference may get bigger when the file gets bigger.

    If you do this a lot (as you said), it clearly is worth converting your files once to binary format, and using them that way. Especially if the file needs to be converted only once, and read many times, the savings will be considerable.

    update

    I repeated the benchmark with a 13M line file. The conversion took 13 seconds, the binary read < 3 seconds. By contrast each of the other two methods took over a minute (textscan: 61s; fscanf: 77s). It seems that things are scaling linearly (file size 470M text, 240M binary)

    0 讨论(0)
  • 2021-01-15 00:41

    Consider using a binary file format. Binary files are much smaller and don't need to be converted by MATLAB into the binary format. Hence they are much faster to read and write. They may also be more accurate (precision may be higher).

    http://www.mathworks.com.au/help/matlab/ref/fread.html

    0 讨论(0)
提交回复
热议问题