How can I efficiently transpose a 67 gb file/Dask dataframe without loading it entirely into memory?

问题

I have 3 rather large files (67gb, 36gb, 30gb) that I need to train models on. However, the features are rows and the samples are columns. Since Dask hasn't implemented transpose and stores DataFrames split by row, I need to write something to do this myself. Is there a way I can efficiently transpose without loading into memory?

I've got 16 gb of ram at my disposal and am using jupyter notebook. I have written some rather slow code, but would really appreciate a faster solution. The speed of the code below will take a month to finish all the files. The slowest step by a few orders of magnitude is awk.

import dask.dataframe as dd
import subprocess
from IPython.display import clear_output

df = dd.read_csv('~/VeryLarge.tsv')
with open('output.csv','wb') as fout:
    for i in range(1, len(df.columns)+1):
        print('AWKing')
        #read a column from the original data and store it elsewhere
        x = "awk '{print $"+str(i)+"}' ~/VeryLarge.tsv > ~/file.temp"
        subprocess.check_call([x], shell=True)

        print('Reading')
        #load and transpose the column
        col = pd.read_csv('~/file.temp')
        row = col.T
        display(row)

        print('Deleting')
        #remove the temporary file created
        !rm ../file.temp

        print('Storing')
        #store the row in its own csv just to be safe. not entirely necessary
        row.to_csv('~/columns/col_{:09d}'.format(i), header=False)

        print('Appending')
        #append the row (transposed column) to the new file
        with open('~/columns/col_{:09d}', 'rb') as fin:
            for line in fin:
                fout.write(line)

        clear_output()
        #Just a measure of progress
        print(i/len(df.columns))

The data itself is 10million rows (features) and 2000 columns (samples). It just needs to be transposed. Currently, it looks like this:

回答1:

I've modified my original script to deploy on any number of cpus. It worked much faster since I could use multiple threads and deployed on aws. I used a 96 core machine that completed the task in about 8 hours. I was quite surprised since that is nearly linear scaling! The idea is to make some repetitive task distributable. Then you will be able to assign tasks to cpus. Here the parallelizing is done with the command pool.map().

The usage of this script from command line is quite simple:

python3 transposer.py -i largeFile.tsv

you can specify other args as well if required.

import argparse, subprocess
import numpy as np
import pandas as pd
import dask.dataframe as dd
from IPython.display import clear_output
from contextlib import closing
from os import cpu_count
from multiprocessing import Pool

parser = argparse.ArgumentParser(description='Transpose csv')
parser.add_argument('-i', '--infile', help='Path to input folder',
                    default=None)
parser.add_argument('-s', '--sep', help='input separator',
                    default='\t')

args = parser.parse_args()
infile = args.infile
sep = args.sep    
df = pd.read_csv(infile, sep='\t', nrows=3)    

def READ_COL(item):
    print(item)
    outfile = 'outfile{}.temp'.format(item)
    if item !=0:
                x = "awk '{print $"+str(item)+"}' "+infile+" > "+outfile
                subprocess.check_call([x], shell=True)
                col = pd.read_csv(outfile)
                row = col.T
                display(row)
                row.to_csv('col_{:09d}.csv'.format(item), header=False)
                subprocess.check_call(['rm '+outfile], shell=True)
                print(item/len(df.columns))

with closing(Pool(processes=cpu_count())) as pool:
    pool.map(READ_COL, list(range(1, len(df.columns)+1)))

After this, you should have a number of files that are transposed columns. You'll just need to join them together with cat or some other command line tool. I just ran cat col_* > full_file_transposed.csv

回答2:

I would create an intermediate file and use fp.seek to write them in binary format in the new order before converting it back into a new CSV. Given row, column becoming column, row - sys.float_info will give you the size of each element, the position of each element ((is column * old_row_length + row) * size of float).

You then re-combine them into a CSV by converting them back into text and reading in old_count_rows per line.

来源：https://stackoverflow.com/questions/54208323/how-can-i-efficiently-transpose-a-67-gb-file-dask-dataframe-without-loading-it-e

标签

python

dataframe

dask