So, I have read multiple sources that try to explain what \'docValues\' are in Solr, but I don\'t seem to understand when I should use them, especially in relation to indexed vs
What are docValues in Solr ?
Doc values can be explained as Lucene's column-stride field value storage or simply its an uninverted index or forward index.
To illustrate with json:
row-oriented (stored fields)
{
'doc1': {'A':1, 'B':2, 'C':3},
'doc2': {'A':2, 'B':3, 'C':4},
'doc3': {'A':4, 'B':3, 'C':2}
}
column-oriented (docValues)
{
'A': {'doc1':1, 'doc2':2, 'doc3':4},
'B': {'doc1':2, 'doc2':3, 'doc3':3},
'C': {'doc1':3, 'doc2':4, 'doc3':2}
}
Purpose of DocValues ?
Stored fields store all field values for one document together in a row-stride fashion. In retrieval, all field values are returned at once per document, so that loading the relevant information about a document is very fast.
However, if you need to scan a field (for faceting/sorting/grouping/highlighting) it will be a slow process, as you will have to iterate through all the documents and load each document's fields per iteration resulting in disk seeks.
For example, sorting, when all the matched documents are found, Lucene need to get the value of a field of each of them. Similarly the faceting engine, for example, must look up each term that appears in each document that will make up the result set and pull the document IDs in order to build the facet list.
Now this problem can be approached in two ways:
Like inverted index docvalues are serialized to disk in that case we can rely on the OS’s file system cache to manage memory instead of retaining structures on the JVM heap.
When should I use them ?
For all the reasons discussed above. If you are in a low-memory environment, or you don’t need to index a field, DocValues are perfect for faceting/grouping/filtering/sorting/function queries. They also have the potential for increasing the number of fields you can facet/group/filter/sort on without increasing your memory requirements. I've been using docvalues in production Solr for sorting and faceting and have seen a huge improvement in performance of these queries.