List list = new ArrayList<>();
for (int i = 0; i < 1000; i++)
{
StringBuilder sb = new StringBuilder();
String string = sb.toString();
You can open JMC and check for GC under Memory tab inside MBean Server of the particular JVM when it performed and how much did it cleared. Still, there is no fixed guarantee of the time when it would be called. You can initiate GC under Diagnostic Commands on a specific JVM.
Hope it helps.
Your example is a bit odd, as it creates 1000 empty strings. If you want to get such a list with consuming minimum memory, you should use
List<String> list = Collections.nCopies(1000, "");
instead.
If we assume that there is something more sophisticated going on, not creating the same string in every iteration, well, then there is no benefit in calling intern()
. What will happen, is implementation dependent. But when calling intern()
on a string that is not in the pool, it will be just added to the pool in the best case, but in the worst case, another copy will be made and added to the pool.
At this point, we have no savings yet, but potentially created additional garbage.
Interning at this point can only save you some memory, if there are duplicates somewhere. This implies that you construct duplicate strings first, to look up their canonical instance via intern()
afterwards, so having the duplicate string in memory until garbage collected, is unavoidable. But that’s not the real problem with interning:
Keep in mind that you pay the price of the disadvantages named above, even in the cases that there are no duplicates, i.e. there is no space saving. Also, the acquired reference to the canonical string has to have a much longer lifetime than the temporary object used to look it up, to have any positive effect on the memory consumption.
The latter touches your literal question. The temporary instances are reclaimed when the garbage collector runs the next time, which will be when the memory is actually needed. There is no need to worry about when this will happen, but well, yes, up to that point, acquiring a canonical reference had no positive effect, not only because the memory hasn’t been reused up to that point, but also, because the memory was not actually needed until then.
This is the place to mention the new String Deduplication feature. This does not change string instances, i.e. the identity of these objects, as that would change the semantic of the program, but change identical strings to use the same char[]
array. Since these character arrays are the biggest payload, this still may achieve great memory savings, without the performance disadvantages of using intern()
. Since this deduplication is done by the garbage collector, it will only applied to strings that survived long enough to make a difference. Also, this implies that it will not waste CPU cycles when there still is plenty of free memory.
However, there might be cases, where manual canonicalization might be justified. Imagine, we’re parsing a source code file or XML file, or importing strings from an external source (Reader
or data base) where such canonicalization will not happen by default, but duplicates may occur with a certain likelihood. If we plan to keep the data for further processing for a longer time, we might want to get rid of duplicate string instances.
In this case, one of the best approaches is to use a local map, not being subject to thread synchronization, dropping it after the process, to avoid keeping references longer than necessary, without having to use special interaction with the garbage collector. This implies that occurrences of the same strings within different data sources are not canonicalized (but still being subject to the JVM’s String Deduplication), but it’s a reasonable trade-off. By using an ordinary resizable HashMap
, we also do not have the issues of the fixed intern
table.
E.g.
static List<String> parse(CharSequence input) {
List<String> result = new ArrayList<>();
Matcher m = TOKEN_PATTERN.matcher(input);
CharBuffer cb = CharBuffer.wrap(input);
HashMap<CharSequence,String> cache = new HashMap<>();
while(m.find()) {
result.add(
cache.computeIfAbsent(cb.subSequence(m.start(), m.end()), Object::toString));
}
return result;
}
Note the use of the CharBuffer
here: it wraps the input sequence and its subSequence
method returns another wrapper with different start and end index, implementing the right equals
and hashCode
method for our HashMap
, and computeIfAbsent
will only invoke the toString
method, if the key was not present in the map before. So, unlike using intern()
, no String
instance will be created for already encountered strings, saving the most expensive aspect of it, the copying of the character arrays.
If we have a really high likelihood of duplicates, we may even save the creation of wrapper instances:
static List<String> parse(CharSequence input) {
List<String> result = new ArrayList<>();
Matcher m = TOKEN_PATTERN.matcher(input);
CharBuffer cb = CharBuffer.wrap(input);
HashMap<CharSequence,String> cache = new HashMap<>();
while(m.find()) {
cb.limit(m.end()).position(m.start());
String s = cache.get(cb);
if(s == null) {
s = cb.toString();
cache.put(CharBuffer.wrap(s), s);
}
result.add(s);
}
return result;
}
This creates only one wrapper per unique string, but also has to perform one additional hash lookup for each unique string when putting. Since the creation of a wrapper is quiet cheap, you really need a significantly large number of duplicate strings, i.e. small number of unique strings compared to the total number, to have a benefit from this trade-off.
As said, these approaches are very efficient, because they use a purely local cache that is just dropped afterwards. With this, we don’t have to deal with thread safety nor interact with the JVM or garbage collector in a special way.