Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

前端 未结 3 1022
忘掉有多难
忘掉有多难 2020-12-08 12:46

I would like to read multiple parquet files into a dataframe from S3. Currently, I\'m using the following method to do this:

files = [\'s3a://dev/2017/01/03/         


        
3条回答
  •  有刺的猬
    2020-12-08 13:20

    Can I observe that as glob-pattern matching includes a full recursive tree-walk and pattern match of the paths, it is an absolute performance killer against object stores, especially S3. There's a special shortcut in spark to recognise when your path doesn't have any glob characters in, in which case it makes a more efficient choice.

    Similarly, a very deep partitioning tree,as in that year/month/day layout, means many directories scanned, at a cost of hundreds of millis (or worse) per directory.

    The layout suggested by Mariusz should be much more efficient, as it is a flatter directory tree —switching to it should have a bigger impact on performance on object stores than real filesystems.

提交回复
热议问题