df1 - large dataset df2 = df1.sample(tiny_fraction) df1 is written to disk as a parquet with snappy compression (~75GB) df2 is written to disk as a parquet with sna