Is querying against a Spark DataFrame based on CSV faster than one based on Parquet?

前端 未结 1 1980
臣服心动
臣服心动 2021-01-01 00:35

I have to load up a CSV file from HDFS using Spark into DataFrame. I was wondering if there is a \"performance\" improvement (query speed) from a DataFrame back

相关标签:
1条回答
  • 2021-01-01 00:57

    CSV is a row-oriented format, while Parquet is a column-oriented format.

    Typically row-oriented formats are more efficient for queries that either must access most of the columns, or only read a fraction of the rows. Column-oriented formats, on the other hand, are usually more efficient for queries that need to read most of the rows, but only have to access a fraction of the columns. Analytical queries typically fall in the latter category, while transactional queries are more often in the first category.

    Additionally, CSV is a text-based format, which can not be parsed as efficiently as a binary format. This makes CSV even slower. A typical column-oriented format on the other hand is not only binary, but also allows more efficient compression, which leads to smaller disk usage and faster access. I recommend reading the Introduction section of The Design and Implementation of Modern Column-Oriented Database Systems.

    Since the Hadoop ecosystem is for analytical queries, Parquet is generally a better choice for performance than CSV for Hadoop applications.

    0 讨论(0)
提交回复
热议问题