Read large amount of data from file in Java

后端 未结 7 1309
一生所求
一生所求 2020-12-03 01:36

I\'ve got text file that contains 1 000 002 numbers in following formation:

123 456
1 2 3 4 5 6 .... 999999 100000

Now I need

相关标签:
7条回答
  • 2020-12-03 02:02

    I would extend FilterReader and parse the string as it is read in the read() method. Have a getNextNumber method return the numbers. Code left as an exercise for the reader.

    0 讨论(0)
  • 2020-12-03 02:06

    It it's possible to reformat the input so that each integer is on a separate line (instead of one long line with one million integers), you should be seeing much improved performance using Integer.parseInt(BufferedReader.readLine()) due to smarter buffering by line and not having to split the long string into a separate array of Strings.

    Edit: I tested this and managed to read the output produced by seq 1 1000000 into an array of int well under half a second, but of course this depends on the machine.

    0 讨论(0)
  • 2020-12-03 02:08

    How much memory do you have in the computer? You could be running into GC issues.

    The best thing to do is to process the data one line at a time if possible. Don't load it into an array. Load what you need, process, write it out, and continue.

    This will reduce your memory footprint and still use the same amount of File IO

    0 讨论(0)
  • 2020-12-03 02:10

    You can reduce the time for the StreamTokenizer result by using a BufferedReader:

    Reader r = null;
    try {
        r = new BufferedReader(new FileReader(file));
        final StreamTokenizer st = new StreamTokenizer(r);
        ...
    } finally {
        if (r != null)
            r.close();
    }
    

    Also, don't forget to close your files, as I've shown here.

    You can also shave some more time off by using a custom tokenizer just for your purposes:

    public class CustomTokenizer {
    
        private final Reader r;
    
        public CustomTokenizer(final Reader r) {
            this.r = r;
        }
    
        public int nextInt() throws IOException {
            int i = r.read();
            if (i == -1)
                throw new EOFException();
    
            char c = (char) i;
    
            // Skip any whitespace
            while (c == ' ' || c == '\n' || c == '\r') {
                i = r.read();
                if (i == -1)
                    throw new EOFException();
                c = (char) i;
            }
    
            int result = (c - '0');
            while ((i = r.read()) >= 0) {
                c = (char) i;
                if (c == ' ' || c == '\n' || c == '\r')
                    break;
                result = result * 10 + (c - '0');
            }
    
            return result;
        }
    
    }
    

    Remember to use a BufferedReader for this. This custom tokenizer assumes the input data is always completely valid and contains only spaces, new lines, and digits.

    If you read these results a lot and those results do not change much, you should probably save the array and keep track of the last file modified time. Then, if the file has not changed just use the cached copy of the array and this will speed up the results significantly. For example:

    public class ArrayRetriever {
    
        private File inputFile;
        private long lastModified;
        private int[] lastResult;
    
        public ArrayRetriever(File file) {
            this.inputFile = file;
        }
    
        public int[] getResult() {
            if (lastResult != null && inputFile.lastModified() == lastModified)
                return lastResult;
    
            lastModified = inputFile.lastModified();
    
            // do logic to actually read the file here
    
            lastResult = array; // the array variable from your examples
            return lastResult;
        }
    
    }
    
    0 讨论(0)
  • 2020-12-03 02:15

    Use a StreamTokenizer on a BufferedReader will give you quite good performance already. You shouldn't need to write your own readInt() function.

    Here is the code I used to do some local performance testing:

    /**
     * Created by zhenhua.xu on 11/27/16.
     */
    public class MyReader {
    
    private static final String FILE_NAME = "./1m_numbers.txt";
    private static final int n = 1000000;
    
    public static void main(String[] args) {
        try {
            readByScanner();
            readByStreamTokenizer();
            readByStreamTokenizerOnBufferedReader();
            readByBufferedInputStream();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    
    public static void readByScanner() throws Exception {
        long startTime = System.currentTimeMillis();
    
        Scanner stdin = new Scanner(new File(FILE_NAME));
        int array[] = new int[n];
        for (int i = 0; i < n; i++) {
            array[i] = stdin.nextInt();
        }
    
        long endTime = System.currentTimeMillis();
        System.out.println(String.format("Total time by Scanner: %d ms", endTime - startTime));
    }
    
    public static void readByStreamTokenizer() throws Exception {
        long startTime = System.currentTimeMillis();
    
        StreamTokenizer st = new StreamTokenizer(new FileReader(FILE_NAME));
        int array[] = new int[n];
    
        for (int i = 0; st.nextToken() != StreamTokenizer.TT_EOF; i++) {
            array[i] = (int) st.nval;
        }
    
        long endTime = System.currentTimeMillis();
        System.out.println(String.format("Total time by StreamTokenizer: %d ms", endTime - startTime));
    }
    
    public static void readByStreamTokenizerOnBufferedReader() throws Exception {
        long startTime = System.currentTimeMillis();
    
        StreamTokenizer st = new StreamTokenizer(new BufferedReader(new FileReader(FILE_NAME)));
        int array[] = new int[n];
    
        for (int i = 0; st.nextToken() != StreamTokenizer.TT_EOF; i++) {
            array[i] = (int) st.nval;
        }
    
        long endTime = System.currentTimeMillis();
        System.out.println(String.format("Total time by StreamTokenizer with BufferedReader: %d ms", endTime - startTime));
    }
    
    public static void readByBufferedInputStream() throws Exception {
        long startTime = System.currentTimeMillis();
    
        BufferedInputStream bis = new BufferedInputStream(new FileInputStream(FILE_NAME));
        int array[] = new int[n];
        for (int i = 0; i < n; i++) {
            array[i] = readInt(bis);
        }
    
        long endTime = System.currentTimeMillis();
        System.out.println(String.format("Total time with BufferedInputStream: %d ms", endTime - startTime));
    }
    
    private static int readInt(InputStream in) throws IOException {
        int ret = 0;
        boolean dig = false;
    
        for (int c = 0; (c = in.read()) != -1; ) {
            if (c >= '0' && c <= '9') {
                dig = true;
                ret = ret * 10 + c - '0';
            } else if (dig) break;
        }
    
        return ret;
    }
    

    Results I got:

    • Total time by Scanner: 789 ms
    • Total time by StreamTokenizer: 226 ms
    • Total time by StreamTokenizer with BufferedReader: 80 ms
    • Total time by BufferedInputStream: 95 ms
    0 讨论(0)
  • 2020-12-03 02:19

    Thanks for every answer but I've already found a method that meets my criteria:

    BufferedInputStream bis = new BufferedInputStream(new FileInputStream("./path"));
    int n = readInt(bis);
    int t = readInt(bis);
    int array[] = new int[n];
    for (int i = 0; i < n; i++) {
        array[i] = readInt(bis);
    }
    
    private static int readInt(InputStream in) throws IOException {
        int ret = 0;
        boolean dig = false;
    
        for (int c = 0; (c = in.read()) != -1; ) {
            if (c >= '0' && c <= '9') {
                dig = true;
                ret = ret * 10 + c - '0';
            } else if (dig) break;
        }
    
        return ret;
    }
    

    It requires only about 300 ms to read 1 mln of integers!

    0 讨论(0)
提交回复
热议问题