I have to load up a CSV file from HDFS using Spark into DataFrame
. I was wondering if there is a \"performance\" improvement (query speed) from a DataFrame back
CSV is a row-oriented format, while Parquet is a column-oriented format.
Typically row-oriented formats are more efficient for queries that either must access most of the columns, or only read a fraction of the rows. Column-oriented formats, on the other hand, are usually more efficient for queries that need to read most of the rows, but only have to access a fraction of the columns. Analytical queries typically fall in the latter category, while transactional queries are more often in the first category.
Additionally, CSV is a text-based format, which can not be parsed as efficiently as a binary format. This makes CSV even slower. A typical column-oriented format on the other hand is not only binary, but also allows more efficient compression, which leads to smaller disk usage and faster access. I recommend reading the Introduction section of The Design and Implementation of Modern Column-Oriented Database Systems.
Since the Hadoop ecosystem is for analytical queries, Parquet is generally a better choice for performance than CSV for Hadoop applications.