Since Java\'s default string interning has got a lot of bad press, I am looking for an alternative.
Can you suggest an API which is a good alternative to Java string
String intern is implemented via a native method. And the C implementation uses a fixed size of some 1k entries and scales very poorly for large number of strings.
It scales poorly for many thousand Strings.
Java 6 stores interned strings in Perm gen. And therefore are not GC'd
It will be cleaned up when the perm gen is cleaned up which is not often but it can mean you reach the maximum of this space if you don't increase it.
My application is a server app with heap size of 10-20G for different deployments.
I suggest you consider using off heap memory. I have 500 GB in off heap memory and about 1 GB in heap in one application. It isn't useful in all cases but worth considering.
During profiling we have figured that hundrends of thousands of string are duplicates and we can significantly improve the memory usage by avoiding storing duplicate strings.
For this I have used a simple array of String. This is very light weight and you can control the upper bound of Strings stored easily.
Here is an example of generic interner.
class Interner<T> {
private final T[] cache;
@SuppressWarnings("unchecked")
public Interner(int primeSize) {
cache = (T[]) new Object[primeSize];
}
public T intern(T t) {
int hash = Math.abs(t.hashCode() % cache.length);
T t2 = cache[hash];
if (t2 != null && t.equals(t2))
return t2;
cache[hash] = t;
return t;
}
}
An interest property of this cache is it doesn't matter that its not thread safe.
For extra speed you can use a power of 2 size and a bit mask, but its more complicated and may not work very well depending on how your hashCodes are calculated.