MPI Reading from a text file

前端 未结 1 978
南旧
南旧 2020-12-01 11:28

Hi guys I am learning to program in MPI and I came across this question. Lets say I have a .txt file with 100,000 rows/lines, how do I chunk them for processing by 4 process

相关标签:
1条回答
  • 2020-12-01 11:59

    Text isn't a great format for parallel processing exactly because you don't know ahead of time where (say) line 25001 begins. So these sorts of problems are often dealt with ahead of time through some preprocessing step, either building an index or partitioning the file into the appropriate number of chunks for each process to read.

    If you really want to do it through MPI, I'd suggest using MPI-IO to read in overlapping chunks of the text file onto the various processors, where the overlap is much longer than you expect your longest line to be, and then have each processor agree on where to start; eg, you could say that the first (or last) new line in the overlap region shared by processes N and N+1 is where process N leaves off and N+1 starts.

    To follow this up with some code,

    #include <stdio.h>
    #include <mpi.h>
    #include <stdlib.h>
    #include <ctype.h>
    #include <string.h>
    
    void parprocess(MPI_File *in, MPI_File *out, const int rank, const int size, const int overlap) {
        MPI_Offset globalstart;
        int mysize;
        char *chunk;
    
        /* read in relevant chunk of file into "chunk",
         * which starts at location in the file globalstart
         * and has size mysize 
         */
        {
            MPI_Offset globalend;
            MPI_Offset filesize;
    
            /* figure out who reads what */
            MPI_File_get_size(*in, &filesize);
            filesize--;  /* get rid of text file eof */
            mysize = filesize/size;
            globalstart = rank * mysize;
            globalend   = globalstart + mysize - 1;
            if (rank == size-1) globalend = filesize-1;
    
            /* add overlap to the end of everyone's chunk except last proc... */
            if (rank != size-1)
                globalend += overlap;
    
            mysize =  globalend - globalstart + 1;
    
            /* allocate memory */
            chunk = malloc( (mysize + 1)*sizeof(char));
    
            /* everyone reads in their part */
            MPI_File_read_at_all(*in, globalstart, chunk, mysize, MPI_CHAR, MPI_STATUS_IGNORE);
            chunk[mysize] = '\0';
        }
    
    
        /*
         * everyone calculate what their start and end *really* are by going 
         * from the first newline after start to the first newline after the
         * overlap region starts (eg, after end - overlap + 1)
         */
    
        int locstart=0, locend=mysize-1;
        if (rank != 0) {
            while(chunk[locstart] != '\n') locstart++;
            locstart++;
        }
        if (rank != size-1) {
            locend-=overlap;
            while(chunk[locend] != '\n') locend++;
        }
        mysize = locend-locstart+1;
    
        /* "Process" our chunk by replacing non-space characters with '1' for
         * rank 1, '2' for rank 2, etc... 
         */
    
        for (int i=locstart; i<=locend; i++) {
            char c = chunk[i];
            chunk[i] = ( isspace(c) ? c : '1' + (char)rank );
        }
    
    
        /* output the processed file */
    
        MPI_File_write_at_all(*out, (MPI_Offset)(globalstart+(MPI_Offset)locstart), &(chunk[locstart]), mysize, MPI_CHAR, MPI_STATUS_IGNORE);
    
        return;
    }
    
    int main(int argc, char **argv) {
    
        MPI_File in, out;
        int rank, size;
        int ierr;
        const int overlap = 100;
    
        MPI_Init(&argc, &argv);
        MPI_Comm_rank(MPI_COMM_WORLD, &rank);
        MPI_Comm_size(MPI_COMM_WORLD, &size);
    
        if (argc != 3) {
            if (rank == 0) fprintf(stderr, "Usage: %s infilename outfilename\n", argv[0]);
            MPI_Finalize();
            exit(1);
        }
    
        ierr = MPI_File_open(MPI_COMM_WORLD, argv[1], MPI_MODE_RDONLY, MPI_INFO_NULL, &in);
        if (ierr) {
            if (rank == 0) fprintf(stderr, "%s: Couldn't open file %s\n", argv[0], argv[1]);
            MPI_Finalize();
            exit(2);
        }
    
        ierr = MPI_File_open(MPI_COMM_WORLD, argv[2], MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &out);
        if (ierr) {
            if (rank == 0) fprintf(stderr, "%s: Couldn't open output file %s\n", argv[0], argv[2]);
            MPI_Finalize();
            exit(3);
        }
    
        parprocess(&in, &out, rank, size, overlap);
    
        MPI_File_close(&in);
        MPI_File_close(&out);
    
        MPI_Finalize();
        return 0;
    }
    

    Running this on a narrow version of the text of the question, we get

    $ mpirun -n 3 ./textio foo.in foo.out
    $ paste foo.in foo.out
    Hi guys I am learning to            11 1111 1 11 11111111 11
    program in MPI and I came           1111111 11 111 111 1 1111
    across this question. Lets          111111 1111 111111111 1111
    say I have a .txt file with         111 1 1111 1 1111 1111 1111
    100,000 rows/lines, how do          1111111 11111111111 111 11
    I chunk them for processing         1 11111 1111 111 1111111111
    by 4 processors? i.e. I want        22 2 22222222222 2222 2 2222
    to let processor 0 take care        22 222 222222222 2 2222 2222
    of the processing for lines         22 222 2222222222 222 22222
    0-25000, processor 1 to take        22222222 222222222 2 22 2222
    care of 25001-50000 and so          2222 22 22222222222 222 22
    on. I did some searching and        333 3 333 3333 333333333 333
    did came across MPI_File_seek       333 3333 333333 3333333333333
    but I am not sure can it work       333 3 33 333 3333 333 33 3333
    on .txt and supports fscanf         33 3333 333 33333333 333333
    afterwards.                         33333333333
    
    0 讨论(0)
提交回复
热议问题