I have an EMR cluster running Spark. In the first step the CSV files are transformed into paruqet.snappy format partitioned by date column, so I am left with
date