Apache Drill vs Spark

蹲街弑〆低调 提交于 2019-12-03 02:15:54

Here's an article I came across that discusses some of the SQL technologies: http://www.zdnet.com/article/sql-and-hadoop-its-complicated/

Drill is fundamentally different in both the user's experience and the architecture. For example:

  • Drill is a schema-free query engine. For instance, you can point it at a directory of JSON or Parquet log files (on your local box, an NFS share, S3, HDFS, MapR-FS, etc.) and run a query. You don't have to load data, create and manage schemas or pre-process the data.
  • Drill uses a JSON document model internally which allows it to query data of any structure. A lot of modern data is complex, meaning a record can contain nested structures and arrays, and field names may actually encode values such timestamps or web page URLs. Drill allows normal BI tools to operate seamlessly on such data without requiring the data to be flattened in advance.
  • Drill works with a variety of non-relational datastores, including Hadoop, NoSQL databases (MongoDB, HBase) and cloud storage. Additional datastores will be added.

Drill 1.0 was just released (May 19, 2015). You can easily download it onto your laptop and play with it without any infrastructure (Hadoop, NoSQL, etc.).

Drill provides the ability for you to query different kinds of datasets with ANSI SQL. This makes it great for adhoc data exploration, and connecting BI tools to datasets via ODBC. You can even use Drill to SQL JOIN different kinds of datasets. For example, you could join records in a MySQL table with rows in a JSON file, or a CSV file, or OpenTSDB, or MapR-DB... the list goes on. Drill can connect to lots of different types of data.

When I think to use Spark, I'm typically wanting to use it for RDDs (resilient distributed dataset). RDDs make it easy to process a lot of data, quickly. Spark also has a bunch of libraries for ML and streaming. Drill doesn't process data at all. It just gets you access to said data. You could use Drill to pull data into Spark, or Tensorflow, or PySpark, or Tableau, etc.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!