Is querying against a Spark DataFrame based on CSV faster than one based on Parquet?

前端未结

关注

 1  1979

臣服心动 2021-01-01 00:35

I have to load up a CSV file from HDFS using Spark into DataFrame. I was wondering if there is a \"performance\" improvement (query speed) from a DataFrame back

1条回答

醉梦人生 (楼主)

2021-01-01 00:57

CSV is a row-oriented format, while Parquet is a column-oriented format.

Typically row-oriented formats are more efficient for queries that either must access most of the columns, or only read a fraction of the rows. Column-oriented formats, on the other hand, are usually more efficient for queries that need to read most of the rows, but only have to access a fraction of the columns. Analytical queries typically fall in the latter category, while transactional queries are more often in the first category.

Additionally, CSV is a text-based format, which can not be parsed as efficiently as a binary format. This makes CSV even slower. A typical column-oriented format on the other hand is not only binary, but also allows more efficient compression, which leads to smaller disk usage and faster access. I recommend reading the Introduction section of The Design and Implementation of Modern Column-Oriented Database Systems.

Since the Hadoop ecosystem is for analytical queries, Parquet is generally a better choice for performance than CSV for Hadoop applications.

0 讨论(0)
发布评论:

提交评论
- 加载中...