Split File - Java/Linux

前端未结

关注

 4  1806

I have a large file contains nearly 250 million characters. Now, I want to split it into parts of each contains 30 million characters ( so first 8 parts will contains 30 million

相关标签:

4条回答

天涯浪人

2021-01-21 03:27
One way is to use regular unix commands to split the file and the prepend the last 1000 bytes from the previous file.

First split the file:
```
split -b 30000000 inputfile part.
```
Then, for each part (ignoring the farst make a new file starting with the last 1000 bytes from the previous:
```
unset prev
for i in part.*
do if [ -n "${prev}" ]
  then 
    tail -c 1000 ${prev} > part.temp
    cat ${i} >> part.temp
    mv part.temp ${i}
  fi
  prev=${i}
done
```
Before assembling we again iterate over the files, ignoring the first and throw away the first 1000 bytes:
```
unset prev
for i in part.*
do if [ -n "${prev}" ]
  then 
    tail -c +1001 ${i} > part.temp
    mv part.temp ${i}
  fi
  prev=${i}
done
```
Last step is to reassemble the files:
```
cat part.* >> newfile
```
Since there was no explanation of why the overlap was needed I just created it and then threw it away.
0 讨论(0)
发布评论:

提交评论
- 加载中...
攒了一身酷

2021-01-21 03:39
You can do it using BreakIterator class and its static method getCharacterInstance(). It Returns a new BreakIterator instance for character breaks for the default locale.

You can also use getWordInstance(), getLineInstance().. to break words, line...etc

eg:
```
BreakIterator boundary = BreakIterator.getCharacterInstance();

boundary.setText("Your_Sentence");

int start = boundary.first();

int end = boundary.next();
```
Iterate over it... to get the Characters....

For more detail look at this link:

http://docs.oracle.com/javase/6/docs/api/java/text/BreakIterator.html
0 讨论(0)
发布评论:

提交评论
- 加载中...
予麋鹿

2021-01-21 03:44

Just use with appropriate options the split or csplit commands.

You may want to drive these programs with a more complex shell script, or using some other scripting language, to give them appropriate arguments (in particular to deal with your overlapping requirement). Perhaps you might combine them with other utilities (like grep or head or tail or sed or awk etc....).

0 讨论(0)
发布评论:

提交评论
- 加载中...

一个人的身影

2021-01-21 03:44

You can try this. I have to used read/mode the first time as the file didn't exist at first. Youc an use read only as this code suggests.

long start = System.nanoTime();
long fileSize = 3200 * 1024 * 1024L;
FileChannel raf = new RandomAccessFile("deleteme.txt", "r").getChannel();
long midPoint = fileSize / 2 / 4096 * 4096;
MappedByteBuffer buffer1 = raf.map(FileChannel.MapMode.READ_ONLY, 0, midPoint + 4096);
MappedByteBuffer buffer2 = raf.map(FileChannel.MapMode.READ_ONLY, midPoint, fileSize - midPoint);
long time = System.nanoTime() - start;
System.out.printf("Took %.3f ms to map a file of %,d bytes long%n", time / 1e6, raf.size());

This is running on a Window 7 x64 box with 4 GB of memory.

Took 3.302 ms to map a file of 3,355,443,200 bytes long

0 讨论(0)