Excessively large overhead in MATLAB .mat file

前端 未结 1 1683
醉酒成梦
醉酒成梦 2021-01-12 19:52

I am parsing a large text file full of data and then saving it to disk as a *.mat file so that I can easily load in only parts of it (see here for more information on readin

1条回答
  •  伪装坚强ぢ
    2021-01-12 20:22

    This seems like a bug to me. A workaround is to write in chunks to pre-allocated arrays.

    Start off by pre-allocating:

    fid = fopen('01_hit12.par', 'r');
    data = fread(fid, inf, 'uint8');
    nlines = nnz(data == 10) + 1;
    fclose(fid);
    
    matObj.moleculeNumber = zeros(1,nlines,'uint8');
    matObj.isotopeologueNumber = zeros(1,nlines,'uint8');
    matObj.vacuumWavenumber = zeros(1,nlines,'double');
    matObj.lineIntensity = zeros(1,nlines,'double');
    matObj.airWidth = zeros(1,nlines,'single');
    matObj.selfWidth = zeros(1,nlines,'single');
    matObj.lowStateE = zeros(1,nlines,'single');
    matObj.tempDependWidth = zeros(1,nlines,'single');
    matObj.pressureShift = zeros(1,nlines,'single');
    

    Then to write in chunks of 10000, I modified your code as follows:

    ... % your code plus pre-alloc first
    bs = 10000;
    while ischar(hitranTemp)
        if abs(hitranTemp(1)) == 32;
            hitranTemp(1) = '0';
        end
    
        for ii = 1:bs,
            hitran{ii} = textscan(hitranTemp,'%2u%1u%12f%10f%10f%5f%5f%10f%4f%8f%15c%15c%15c%15c%6u%2u%2u%2u%2u%2u%2    u%1c%7f%7f','delimiter','','whitespace','');
            hitranTemp = fgetl(fidr);
            if hitranTemp==-1, bs=ii; break; end
        end
    
        % this part really ugly, sorry! trying to keep it compact...
        matObj.moleculeNumber(1,k:k+bs-1)      = uint8(builtin('_paren',cellfun(@(c)c{1},hitran),1:bs));
        matObj.isotopeologueNumber(1,k:k+bs-1) = uint8(builtin('_paren',cellfun(@(c)c{2},hitran),1:bs));
        matObj.vacuumWavenumber(1,k:k+bs-1)    = builtin('_paren',cellfun(@(c)c{3},hitran),1:bs);
        matObj.lineIntensity(1,k:k+bs-1)       = builtin('_paren',cellfun(@(c)c{4},hitran),1:bs);
        matObj.airWidth(1,k:k+bs-1)            = single(builtin('_paren',cellfun(@(c)c{5},hitran),1:bs));
        matObj.selfWidth(1,k:k+bs-1)           = single(builtin('_paren',cellfun(@(c)c{6},hitran),1:bs));
        matObj.lowStateE(1,k:k+bs-1)           = single(builtin('_paren',cellfun(@(c)c{7},hitran),1:bs));
        matObj.tempDependWidth(1,k:k+bs-1)     = single(builtin('_paren',cellfun(@(c)c{8},hitran),1:bs));
        matObj.pressureShift(1,k:k+bs-1)       = single(builtin('_paren',cellfun(@(c)c{9},hitran),1:bs));
    
        k = k + bs;
        fprintf('.');
    end
    fclose(fidr);
    

    The final size on disk is 21,393,408 bytes. The usage breaks down as,

    >> S = whos('-file','01_hit12.mat');
    >> fileBytes = sum([S.bytes]);
    >> T = dir(which('01_hit12.mat'));
    >> diskBytes = T.bytes; ratio = diskBytes/fileBytes;
    >> fprintf('%10d whos\n%10d disk\n%10.6f\n',fileBytes,diskBytes,ratio)
       8531608 whos
      21389582 disk
      2.507099
    

    Still fairly inefficient, but not out of control.

    0 讨论(0)
提交回复
热议问题