Import big files/arrays with mathematica

前端未结

关注

 1  1323

I work with mathematica 8.0.1.0 on a Windows7 32bit platform. I try to import data with

Import[file,”Table”]

which works fine as long as th

General memory-efficient solution

Here is a much more memory - efficient function:

Clear[readTable];
readTable[file_String?FileExistsQ, chunkSize_: 100] :=
   Module[{str, stream, dataChunk, result , linkedList, add},
      SetAttributes[linkedList, HoldAllComplete];
      add[ll_, value_] := linkedList[ll, value];           
      stream  = StringToStream[Import[file, "String"]];
      Internal`WithLocalSettings[
         Null,
         (* main code *)
         result = linkedList[];
         While[dataChunk =!= {},
           dataChunk = 
              ImportString[
                 StringJoin[Riffle[ReadList[stream, "String", chunkSize], "\n"]], 
                 "Table"];
           result = add[result, dataChunk];
         ];
         result = Flatten[result, Infinity, linkedList],
         (* clean-up *)
         Close[stream]
      ];
      Join @@ result]

Here I confront it with the standard Import, for your file:

In[3]:= used = MaxMemoryUsed[]
Out[3]= 18009752

In[4]:= 
tt = readTable["C:\\Users\\Archie\\Downloads\\ExampleFile\\ExampleFile.txt"];//Timing
Out[4]= {34.367,Null}

In[5]:= used = MaxMemoryUsed[]-used
Out[5]= 228975672

In[6]:= 
t = Import["C:\\Users\\Archie\\Downloads\\ExampleFile\\ExampleFile.txt","Table"];//Timing
Out[6]= {25.615,Null}

In[7]:= used = MaxMemoryUsed[]-used
Out[7]= 2187743192

In[8]:= tt===t
Out[8]= True

You can see that my code is about 10 times more memory-efficient than Import, while being not much slower. You can control the memory consumption by adjusting the chunkSize parameter. Your resulting table occupies about 150 - 200 MB of RAM.

EDIT

Getting yet more efficient for sparse tables

I want to illustrate how one can make this function yet 2-3 times more memory-efficient during the import, plus another order of magnitude more memory-efficient in terms of final memory occupied by your table, using SparseArray-s. The degree to which we get memory efficiency gains depends much on how sparse is your table. In your example, the table is very sparse.

The anatomy of sparse arrays

We start with a generally useful API for construction and deconstruction of SparseArray objects:

ClearAll[spart, getIC, getJR, getSparseData, getDefaultElement, makeSparseArray];
HoldPattern[spart[SparseArray[s___], p_]] := {s}[[p]];
getIC[s_SparseArray] := spart[s, 4][[2, 1]];
getJR[s_SparseArray] := Flatten@spart[s, 4][[2, 2]];
getSparseData[s_SparseArray] := spart[s, 4][[3]];
getDefaultElement[s_SparseArray] := spart[s, 3];
makeSparseArray[dims : {_, _}, jc : {__Integer}, ir : {__Integer}, 
     data_List, defElem_: 0] :=
 SparseArray @@ {Automatic, dims, defElem, {1, {jc, List /@ ir}, data}};

Some brief comments are in order. Here is a sample sparse array:

In[15]:= 
ToHeldExpression@ToString@FullForm[sp  = SparseArray[{{0,0,1,0,2},{3,0,0,0,4},{0,5,0,6,7}}]]

Out[15]= 
Hold[SparseArray[Automatic,{3,5},0,{1,{{0,2,4,7},{{3},{5},{1},{5},{2},{4},{5}}},
{1,2,3,4,5,6,7}}]]

(I used ToString - ToHeldExpression cycle to convert List[...] etc in the FullForm back to {...} for the ease of reading). Here, {3,5} are obviously dimensions. Next is 0, the default element. Next is a nested list, which we can denote as {1,{ic,jr}, sparseData}. Here, ic gives a total number of nonzero elements as we add rows - so it is first 0, then 2 after first row, the second adds 2 more, and the last adds 3 more. The next list, jr, gives positions of non-zero elements in all rows, so they are 3 and 5 for the first row, 1 and 5 for the second, and 2, 4 and 5 for the last one. There is no confusion as to where which row starts and ends here, since this can be determined by the ic list. Finally, we have the sparseData, which is a list of the non-zero elements as read row by row from left to right (the ordering is the same as for the jr list). This explains the internal format in which SparseArray-s store their elements, and hopefully clarifies the role of the functions above.

The code

Clear[readSparseTable];
readSparseTable[file_String?FileExistsQ, chunkSize_: 100] :=
   Module[{stream, dataChunk, start, ic = {}, jr = {}, sparseData = {}, 
        getDataChunkCode, dims},
     stream  = StringToStream[Import[file, "String"]];
     getDataChunkCode := 
       If[# === {}, {}, SparseArray[#]] &@
         ImportString[
             StringJoin[Riffle[ReadList[stream, "String", chunkSize], "\n"]], 
             "Table"];
     Internal`WithLocalSettings[
        Null,
        (* main code *)
        start = getDataChunkCode;
        ic = getIC[start];
        jr = getJR[start];
        sparseData = getSparseData[start];
        dims = Dimensions[start];
        While[True,
           dataChunk = getDataChunkCode;
           If[dataChunk === {}, Break[]];
           ic = Join[ic, Rest@getIC[dataChunk] + Last@ic];
           jr = Join[jr, getJR[dataChunk]];
           sparseData = Join[sparseData, getSparseData[dataChunk]];
           dims[[1]] += First[Dimensions[dataChunk]];
        ],
        (* clean - up *)
        Close[stream]
     ];
     makeSparseArray[dims, ic, jr, sparseData]]

Benchmarks and comparisons

Here is the starting amount of used memory (fresh kernel):

In[10]:= used = MemoryInUse[]
Out[10]= 17910208

We call our function:

In[11]:= 
(tsparse= readSparseTable["C:\\Users\\Archie\\Downloads\\ExampleFile\\ExampleFile.txt"]);//Timing
Out[11]= {39.874,Null}

So, it is the same speed as readTable. How about the memory usage?

In[12]:= used = MaxMemoryUsed[]-used
Out[12]= 80863296

I think, this is quite remarkable: we only ever used twice as much memory as is the file on disk occupying itself. But, even more remarkably, the final memory usage (after the computation finished) has been dramatically reduced:

In[13]:= MemoryInUse[]
Out[13]= 26924456

This is because we use the SparseArray:

In[15]:= {tsparse,ByteCount[tsparse]}
Out[15]= {SparseArray[<326766>,{9429,2052}],12103816}

So, our table takes only 12 MB of RAM. We can compare it to our more general function:

In[18]:= 
(t = readTable["C:\\Users\\Archie\\Downloads\\ExampleFile\\ExampleFile.txt"]);//Timing
Out[18]= {38.516,Null}

The results are the same once we convert our sparse table back to normal:

In[20]:= Normal@tsparse==t
Out[20]= True

while the normal table occupies vastly more space (it appears that ByteCount overcounts the occupied memory about 3-4 times, but the real difference is still at least order of magnitude):

In[21]:= ByteCount[t]
Out[21]= 619900248

0 讨论(0)