I have been using Dask + Pandas + PyArrow + HDFS + Parquet for a while in a project that stores tweets in Parquet files to then load them as Dask/Pandas dataframes to perform so