Copy large data file using parallel I/O

女生的网名这么多〃 提交于 2019-12-25 07:05:08

问题


I have a fairly big data set, about 141M lines with .csv formatted. I want to use MPI commands with C++ to copy and manipulate a few columns, but I'm a newbie on both C++ and MPI.

So far my code looks like this

#include <stdio.h>
#include "mpi.h"

using namespace std;

int main(int argc, char **argv)
{
    int i, rank, nprocs, size, offset, nints, bufsize, N=4;
    MPI_File fp, fpwrite; // File pointer
    MPI_Status status;
    MPI_Offset filesize;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
    MPI_File_get_size(fp, &filesize);

    int buf[N];
    for (i = 0; i<N; i++)
        buf[i] = i;
    offset = rank * (N/size)*sizeof(int);
    MPI_File_open(MPI_COMM_WORLD, "new.csv", MPI_MODE_RDONLY, MPI_INFO_NULL, &fp);

    MPI_File_open(MPI_COMM_WORLD, "Ntest.csv", MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &fpwrite);

    MPI_File_read(fp, buf, N, MPI_INT, &status);

    // printf("\nrank: %d, buf[%d]: %d\n", rank, rank*bufsize, buf[0]);
    printf("My rank is: %d\n", rank);
    MPI_File_write_at(fpwrite, offset, buf, (N/size), MPI_INT, &status);

    /* // repeat the process again
    MPI_Barrier(MPI_COMM_WORLD);
    printf("2/ My rank is: %d\n", rank); */

    MPI_File_close(&fp);
    MPI_File_close(&fpwrite);
    MPI_Finalize();
}

I'm not sure where to start, and I've seen a few examples with lustre stripes. I would like to go that direction if possible. Additional options include HDF5 and T3PIO.


回答1:


You are way too early to worry about lustre stripes, aside from the fact that lustre stripes are by default something ridiculously small for a "parallel file system". Increase the stripe size of the directory where you will write and read these files with lfs setstripe

Your first challenge will be how to decompose this CSV file. What does a typical row look like? If the rows are of variable length, you're going to have a bit of a headache. Here's why:

consider a CSV file with 3 rows and 3 MPI processes.

  1. One row is aa,b,c (8 bytes).
  2. row is aaaaaaa,bbbbbbb,ccccccc (24 bytes).
  3. third row is ,,c (4 bytes) .

(darnit, markdown, how do I make this list start at zero?)

Rank 0 can read from the beginning of the file, but where will rank 1 and 2 start? If you simply divide total size (8+24+4=36) by 3, then the decomposistion is

  1. 0 ends up reading aa,b,c\naaaaaa,
  2. 1 reads a,bbbbbbb,ccc, and
  3. reads cccc\n,,c\n

The two approaches to unstructured text input are as follows. One option is to index your file, either after the fact or as the file is being generated. This index would store the beginning offset of every row. Rank 0 reads the offset then broadcasts to everyone else.

The second option is to do this initial decomposition by file size, then fix up the splits. In the above simple example, rank 0 would send everything after the newline to rank 1. Rank 1 would receive the new data and glue it to the beginning of its row and send everything after its own newline to rank 2. This is extremely fiddly and I would not suggest it for someone just starting MPI-IO.

HDF5 is a good option here! Instead of trying to write your own parallel CSV parser, have your CSV creator generate an HDF5 dataset. HDF5, among other features, will keep that index i mentioned for you, so you can set up hyperslabs and do parallel reading and writing.



来源:https://stackoverflow.com/questions/31705877/copy-large-data-file-using-parallel-i-o

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!