HashMap stored on disk is very slow to read back from disk

问题

I have a HashMap that stores external uids and then it stores a different id ( internal for our app ) that has been set for the given uid.

e.g:

123.345.432=00001
123.354.433=00002

The map is checked by uid to make sure the same internal id will be used. If something is resent to the application.

DICOMUID2StudyIdentiferMap defined as follows:

private static Map DICOMUID2StudyIdentiferMap = Collections.synchronizedMap(new HashMap());

The load however will overwrite it, if we successfully load, otherwise it will use the default empty HashMap.

Its read back from disk by doing:

FileInputStream f = new FileInputStream( studyUIDFile );  
ObjectInputStream s = new ObjectInputStream( f );

Map loadedMap = ( Map )s.readObject();
DICOMUID2StudyIdentiferMap = Collections.synchronizedMap( loadedMap );

The HashMap is written to disk using:

FileOutputStream f = new FileOutputStream( studyUIDFile );
ObjectOutputStream s = new ObjectOutputStream( f );

s.writeObject(DICOMUID2StudyIdentiferMap);

The issue I have is, locally running in Eclipse performance is fine, but when the application is running in normal use on a machine the HashMap is taking several minutes to load from disk. Once loaded it also takes a long time to check for a previous value by say seeing if DICOMUID2StudyIdentiferMap.put(..., ...) will return a value.

I load the same map object in both cases, its a ~400kb file. The HashMap that it contains has about ~3000 key-value pairs.

Why is it so slow on one machine, but not in eclipse?

The machine is a VM running XP it has only recently started becoming slow to read the HashMap, so it must be related to the size of it, however 400kb isn't very big I don't think.

Any advice welcome, TIA

回答1:

Not sure that serialising your Map is the best option. If the Map is disk-based for persistance, why not use a lib that's designed for disk? Check out Kyoto Cabinet. It's actually written in c++ but there is a java API. I've used it several times, it's very easy to use, very fast and can scale to a huge size.

This is an example I'm copy/pasting for Tokyo cabinet, the old version of Kyoto, but it's basically the same:

import tokyocabinet.HDB;

....

String dir = "/path/to/my/dir/";
HDB hash = new HDB();

// open the hash for read/write, create if does not exist on disk
if (!hash.open(dir + "unigrams.tch", HDB.OWRITER | HDB.OCREAT)) {
    throw new IOException("Unable to open " + dir + "unigrams.tch: " + hash.errmsg());
}

// Add something to the hash
hash.put("blah", "my string");

// Close it
hash.close();

回答2:

As @biziclop comments, you should start by using a profiler to see where your application is spending all of its time.

If that doesn't give you any results, here are a couple of theories.

It could be that your application is getting close to running out of heap. As the JVM gets close to running out of heap, it can spend nearly all of its time garbage collecting in a vain attempt to keep going. This will show up if you enable GC logging.
It could be that the ObjectInputStream and ObjectOutputStream are doing huge numbers of small read syscalls. Try wrapping the file streams with buffered streams and see if it speeds things up noticeably.

Why is it so slow on one machine, but not in eclipse?

The "full heap" theory could explain that. The default heap size for Eclipse is a lot bigger than for an application launched using java ... with no heap size options.

回答3:

Maybe you should look for alternatives that work similar like a Map, e.g. SimpleDB, BerkeleyDB or Google BigTable.

回答4:

Voldemort is a popular open-source key-value store by Linkedin. I advice you to have a look at the source-code to see how they did things. Right now I am looking at the serialization part at https://github.com/voldemort/voldemort/blob/master/src/java/voldemort/serialization/ObjectSerializer.java. Looking at the code they are using ByteArrayOutputStream which I assume is more efficient way to read/write to/from disc.

Why is it so slow on one machine, but not in eclipse?

Not really clear from your question, but is Eclipse running in VM(VirtualBox?)? Because if so it might be the case that is faster because the complete VM is stored in memory which is a lot faster than accessing the disc.

回答5:

Here is a list of 122 NoSQL databases you could use as an alternative.

You have two expensive operations here, one is the serialization of objects and the second is disk access. You can speed up access by only reading/writing the data you need. The seralization you can speed up by using a custom format.

You could also change the structure of your data to make it more efficient. If you want to reload/rewrite the whole map each time I would suggest using the following approach.

private Map<Integer, Integer> mapping = new LinkedHashMap<Integer, Integer>();

public void saveTo(File file) throws IOException {
    DataOutputStream dos = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(file)));
    dos.writeInt(mapping.size());
    for (Map.Entry<Integer, Integer> entry : mapping.entrySet()) {
        dos.writeInt(entry.getKey());
        dos.writeInt(entry.getValue());
    }
    dos.close();
}

public void loadFrom(File file) throws IOException {
    DataInputStream dis = new DataInputStream(new BufferedInputStream(new FileInputStream(file)));
    mapping.clear();
    int len = dis.readInt();
    for (int i = 0; i < len; i++)
        mapping.put(dis.readInt(), dis.readInt());
    dis.close();
}

public static void main(String[] args) throws IOException {
    Random rand = new Random();
    Main main = new Main();
    for (int i = 1; i <= 3000; i++) {
        // 100,000,000 to 999,999,999
        int uid = 100000000 + rand.nextInt(900000000); 
        main.mapping.put(uid, i);
    }
    final File file = File.createTempFile("deleteme", "data");
    file.deleteOnExit();
    for (int i = 0; i < 10; i++) {
        long start = System.nanoTime();
        main.saveTo(file);
        long mid = System.nanoTime();
        new Main().loadFrom(file);
        long end = System.nanoTime();
        System.out.printf("Took %.3f ms to save and %.3f ms to load %,d entries.%n",
                (end - mid) / 1e6, (mid - start) / 1e6, main.mapping.size());
    }
}

prints

Took 1.203 ms to save and 1.706 ms to load 3,000 entries.
Took 1.209 ms to save and 1.203 ms to load 3,000 entries.
Took 0.961 ms to save and 0.966 ms to load 3,000 entries.

Using TIntIntHashMap instead is about 10% faster.

Increasing the size of the Map to 1 million entries prints

Took 412.718 ms to save and 62.009 ms to load 1,000,000 entries.
Took 403.135 ms to save and 61.756 ms to load 1,000,000 entries.
Took 399.431 ms to save and 61.816 ms to load 1,000,000 entries.

回答6:

I think this may be a hashing problem. What is the type of the key you are using in the Map, and does it have an efficient hashCode() method that spreads out the keys well?

来源：https://stackoverflow.com/questions/6678202/hashmap-stored-on-disk-is-very-slow-to-read-back-from-disk

标签

java

hashmap

writetofile