I have a large file contains nearly 250 million characters. Now, I want to split it into parts of each contains 30 million characters ( so first 8 parts will contains 30 million
One way is to use regular unix commands to split the file and the prepend the last 1000 bytes from the previous file.
First split the file:
split -b 30000000 inputfile part.
Then, for each part (ignoring the farst make a new file starting with the last 1000 bytes from the previous:
unset prev
for i in part.*
do if [ -n "${prev}" ]
then
tail -c 1000 ${prev} > part.temp
cat ${i} >> part.temp
mv part.temp ${i}
fi
prev=${i}
done
Before assembling we again iterate over the files, ignoring the first and throw away the first 1000 bytes:
unset prev
for i in part.*
do if [ -n "${prev}" ]
then
tail -c +1001 ${i} > part.temp
mv part.temp ${i}
fi
prev=${i}
done
Last step is to reassemble the files:
cat part.* >> newfile
Since there was no explanation of why the overlap was needed I just created it and then threw it away.
You can do it using BreakIterator class and its static method getCharacterInstance(). It Returns a new BreakIterator instance for character breaks for the default locale.
You can also use getWordInstance(), getLineInstance().. to break words, line...etc
eg:
BreakIterator boundary = BreakIterator.getCharacterInstance();
boundary.setText("Your_Sentence");
int start = boundary.first();
int end = boundary.next();
Iterate over it... to get the Characters....
For more detail look at this link:
http://docs.oracle.com/javase/6/docs/api/java/text/BreakIterator.html
Just use with appropriate options the split or csplit commands.
You may want to drive these programs with a more complex shell script, or using some other scripting language, to give them appropriate arguments (in particular to deal with your overlapping requirement). Perhaps you might combine them with other utilities (like grep
or head
or tail
or sed
or awk
etc....).
You can try this. I have to used read/mode the first time as the file didn't exist at first. Youc an use read only as this code suggests.
long start = System.nanoTime();
long fileSize = 3200 * 1024 * 1024L;
FileChannel raf = new RandomAccessFile("deleteme.txt", "r").getChannel();
long midPoint = fileSize / 2 / 4096 * 4096;
MappedByteBuffer buffer1 = raf.map(FileChannel.MapMode.READ_ONLY, 0, midPoint + 4096);
MappedByteBuffer buffer2 = raf.map(FileChannel.MapMode.READ_ONLY, midPoint, fileSize - midPoint);
long time = System.nanoTime() - start;
System.out.printf("Took %.3f ms to map a file of %,d bytes long%n", time / 1e6, raf.size());
This is running on a Window 7 x64 box with 4 GB of memory.
Took 3.302 ms to map a file of 3,355,443,200 bytes long