I know ElasticSearch is built upon Apache Lucene but I want to know the significant differences between the two.
I'll add another angle to the discussion.
The Elasticsearch index is a chunk of documents just like databases consist of tables in relational world.
In order to achieve scaling we spread the Elasticsearch Indices into multiple physical nodes / servers.
For that, we break the Elasticsearch Indices into smaller units which are called shards.
Question: How it is related to Lucene index?
If we want to search for a specific term (for example: "Cake" or "Cookie") we'll have to go over each shard and look for it (lets put aside how shards are being located and replicated on each node).
This operation will take a lot of time - so we need to use an efficient data structure for this search - this is where Lucene's index comes into play.
Each Elasticsearch shard is based on the Lucene index structure and stores statistics about terms in order to make term-based search more efficient.
(!) This is quiet confusing because of the word "index" and the fact that an Elasticsearch shard is a portion of Elasticsearch index BUT is based on a data structure of Lucene index .
As can be seen in the example below , Lucene's index stores the original document’s content plus additional information, such as term dictionary and term frequencies, which increase searching efficiency:
Term Document Frequency
Cake doc_id_1, doc_id_8 4 (2 in doc_id_1, 2 in doc_id_8)
Cookie doc_id_1, doc_id_6 3 (2 in doc_id_1, 1 in doc_id_6)
Spaghetti doc_id_12 1 (1 in doc_id_12)
Lucene's index falls into the family of indexes known as an inverted index. This is because it can list, for a term, the documents that contain it.
This is the inverse of the natural relationship, in which documents list terms.
(1) Shard is a directory of files which contains documents.
(2) A document is a sequence of fields.
(3) A field is a named sequence of terms.
In addition to @Vineeth Mohan words:
High Availability: Elasticsearch is distributed, so that it can manage data replication, which means having multiple copies of data in your cluster. This enables high availability.
Powerful Query DSL:Elasticsearch offers us, JSON interface for reading and writing queries on top of Lucene. Thanks to Elasticsearch, you can write complex queries without knowing Lucene syntax.
Schemaless (Schema-Free): Fields(name,value pairs) for schema
do not have to be defined before. When you index data, elasticsearch can create schema automatically at runtime, like magic.
I'll answer from a usage perspective.
Lucene is a search engine library. You'd want to use it to build your own search engine: either a new Elasticsearch or Solr competitor or something narrow for your use-case (e.g. text analysis).
Elasticsearch is a search engine. Most people use it for log aggregation, product search, or a variant of these two (e.g. social media analysis or finding relevant people for some search criteria). It's built on top of Lucene, so it exposes most (though not all) of its features. It also adds a lot on top, most significantly:
Lucene is a Java library. You can include it in your project and refer to its functions using function calls.
Elasticsearch is a JSON Based, Distributed, web server built over Lucene. Though it's Lucene who is doing the actual work beneath, Elasticsearch provides us a convenient layer over Lucene. Each shard that gets created in Elasticsearch is a separate Lucene instance. So to summarize