Split File - Java/Linux

前端 未结 4 1806
情歌与酒
情歌与酒 2021-01-21 03:12

I have a large file contains nearly 250 million characters. Now, I want to split it into parts of each contains 30 million characters ( so first 8 parts will contains 30 million

相关标签:
4条回答
  • 2021-01-21 03:27

    One way is to use regular unix commands to split the file and the prepend the last 1000 bytes from the previous file.

    First split the file:

    split -b 30000000 inputfile part.
    

    Then, for each part (ignoring the farst make a new file starting with the last 1000 bytes from the previous:

    unset prev
    for i in part.*
    do if [ -n "${prev}" ]
      then 
        tail -c 1000 ${prev} > part.temp
        cat ${i} >> part.temp
        mv part.temp ${i}
      fi
      prev=${i}
    done
    

    Before assembling we again iterate over the files, ignoring the first and throw away the first 1000 bytes:

    unset prev
    for i in part.*
    do if [ -n "${prev}" ]
      then 
        tail -c +1001 ${i} > part.temp
        mv part.temp ${i}
      fi
      prev=${i}
    done
    

    Last step is to reassemble the files:

    cat part.* >> newfile
    

    Since there was no explanation of why the overlap was needed I just created it and then threw it away.

    0 讨论(0)
  • 2021-01-21 03:39

    You can do it using BreakIterator class and its static method getCharacterInstance(). It Returns a new BreakIterator instance for character breaks for the default locale.

    You can also use getWordInstance(), getLineInstance().. to break words, line...etc

    eg:

    BreakIterator boundary = BreakIterator.getCharacterInstance();
    
    boundary.setText("Your_Sentence");
    
    int start = boundary.first();
    
    int end = boundary.next();
    

    Iterate over it... to get the Characters....

    For more detail look at this link:

    http://docs.oracle.com/javase/6/docs/api/java/text/BreakIterator.html

    0 讨论(0)
  • 2021-01-21 03:44

    Just use with appropriate options the split or csplit commands.

    You may want to drive these programs with a more complex shell script, or using some other scripting language, to give them appropriate arguments (in particular to deal with your overlapping requirement). Perhaps you might combine them with other utilities (like grep or head or tail or sed or awk etc....).

    0 讨论(0)
  • 2021-01-21 03:44

    You can try this. I have to used read/mode the first time as the file didn't exist at first. Youc an use read only as this code suggests.

    long start = System.nanoTime();
    long fileSize = 3200 * 1024 * 1024L;
    FileChannel raf = new RandomAccessFile("deleteme.txt", "r").getChannel();
    long midPoint = fileSize / 2 / 4096 * 4096;
    MappedByteBuffer buffer1 = raf.map(FileChannel.MapMode.READ_ONLY, 0, midPoint + 4096);
    MappedByteBuffer buffer2 = raf.map(FileChannel.MapMode.READ_ONLY, midPoint, fileSize - midPoint);
    long time = System.nanoTime() - start;
    System.out.printf("Took %.3f ms to map a file of %,d bytes long%n", time / 1e6, raf.size());
    

    This is running on a Window 7 x64 box with 4 GB of memory.

    Took 3.302 ms to map a file of 3,355,443,200 bytes long
    
    0 讨论(0)
提交回复
热议问题