I have to keep thousands of strings in memory to be accessed serially in Java. Should I store them in an array or should I use some kind of List ?
Since arrays keep
If you have thousands, consider using a trie. A trie is a tree-like structure that merges the common prefixes of the stored string.
For example, if the strings were
intern
international
internationalize
internet
internets
The trie would store:
intern
-> \0
international
-> \0
-> ize\0
net
->\0
->s\0
The strings requires 57 characters (including the null terminator, '\0') for storage, plus whatever the size of the String object that holds them. (In truth, we should probably round all sizes up to multiples of 16, but...) Call it 57 + 5 = 62 bytes, roughly.
The trie requires 29 (including the null terminator, '\0') for storage, plus sizeof the trie nodes, which are a ref to an array and a list of child trie nodes.
For this example, that probably comes out about the same; for thousands, it probably comes out less as long as you do have common prefixes.
Now, when using the trie in other code, you'll have to convert to String, probably using a StringBuffer as an intermediary. If many of the strings are in use at once as Strings, outside the trie, it's a loss.
But if you're only using a few at the time -- say, to look up things in a dictionary -- the trie can save you a lot of space. Definitely less space than storing them in a HashSet.
You say you're accessing them "serially" -- if that means sequentially an alphabetically, the trie also obviously gives you alphabetical order for free, if you iterate it depth-first.
A lot of microbenchmarks given here have found numbers of a few nanoseconds for things like array/ArrayList reads. This is quite reasonable if everything is in your L1 cache.
A higher level cache or main memory access can have order of magnitude times of something like 10nS-100nS, vs more like 1nS for L1 cache. Accessing an ArrayList has an extra memory indirection, and in a real application you could pay this cost anything from almost never to every time, depending on what your code is doing between accesses. And, of course, if you have a lot of small ArrayLists this might add to your memory use and make it more likely you'll have cache misses.
The original poster appears to be using just one and accessing a lot of contents in a short time, so it should be no great hardship. But it might be different for other people, and you should watch out when interpreting microbenchmarks.
Java Strings, however, are appallingly wasteful, especially if you store lots of small ones (just look at them with a memory analyzer, it seems to be > 60 bytes for a string of a few characters). An array of strings has an indirection to the String object, and another from the String object to a char[] which contains the string itself. If anything's going to blow your L1 cache it's this, combined with thousands or tens of thousands of Strings. So, if you're serious - really serious - about scraping out as much performance as possible then you could look at doing it differently. You could, say, hold two arrays, a char[] with all the strings in it, one after another, and an int[] with offsets to the starts. This will be a PITA to do anything with, and you almost certainly don't need it. And if you do, you've chosen the wrong language.
No, because technically, the array only stores the reference to the strings. The strings themselves are allocated in a different location. For a thousand items, I would say a list would be better, it is slower, but it offers more flexibility and it's easier to use, especially if you are going to resize them.
Array vs. List choice is not so important (considering performance) in the case of storing string objects. Because both array and list will store string object references, not the actual objects.
You should prefer generic types over arrays. As mentioned by others, arrays are inflexible and do not have the expressive power of generic types. (They do however support runtime typechecking, but that mixes badly with generic types.)
But, as always, when optimizing you should always follow these steps:
I'm guessing the original poster is coming from a C++/STL background which is causing some confusion. In C++ std::list
is a doubly linked list.
In Java [java.util.]List
is an implementation-free interface (pure abstract class in C++ terms). List
can be a doubly linked list - java.util.LinkedList
is provided. However, 99 times out of 100 when you want a make a new List
, you want to use java.util.ArrayList
instead, which is the rough equivalent of C++ std::vector
. There are other standard implementations, such as those returned by java.util.Collections.emptyList()
and java.util.Arrays.asList()
.
From a performance standpoint there is a very small hit from having to go through an interface and an extra object, however runtime inlining means this rarely has any significance. Also remember that String
are typically an object plus array. So for each entry, you probably have two other objects. In C++ std::vector<std::string>
, although copying by value without a pointer as such, the character arrays will form an object for string (and these will not usually be shared).
If this particular code is really performance-sensitive, you could create a single char[]
array (or even byte[]
) for all the characters of all the strings, and then an array of offsets. IIRC, this is how javac is implemented.