Reading parquet files from multiple directories in Pyspark

后端 未结 5 2093
北恋
北恋 2020-12-03 15:21

I need to read parquet files from multiple paths that are not parent or child directories.

for example,

dir1 ---
       |
       ------- dir1_1
            


        
相关标签:
5条回答
  • 2020-12-03 15:36

    Both the parquetFile method of SQLContext and the parquet method of DataFrameReader take multiple paths. So either of these works:

    df = sqlContext.parquetFile('/dir1/dir1_2', '/dir2/dir2_1')
    

    or

    df = sqlContext.read.parquet('/dir1/dir1_2', '/dir2/dir2_1')
    
    0 讨论(0)
  • 2020-12-03 15:41

    A little late but I found this while I was searching and it may help someone else...

    You might also try unpacking the argument list to spark.read.parquet()

    paths=['foo','bar']
    df=spark.read.parquet(*paths)
    

    This is convenient if you want to pass a few blobs into the path argument:

    basePath='s3://bucket/'
    paths=['s3://bucket/partition_value1=*/partition_value2=2017-04-*',
           's3://bucket/partition_value1=*/partition_value2=2017-05-*'
          ]
    df=spark.read.option("basePath",basePath).parquet(*paths)
    

    This is cool cause you don't need to list all the files in the basePath, and you still get partition inference.

    0 讨论(0)
  • 2020-12-03 15:42

    In case you have a list of files you can do:

    files = ['file1', 'file2',...]
    df = spark.read.parquet(*files)
    
    0 讨论(0)
  • 2020-12-03 15:54

    Just taking John Conley's answer, and embellishing it a bit and providing the full code (used in Jupyter PySpark) as I found his answer extremely useful.

    from hdfs import InsecureClient
    client = InsecureClient('http://localhost:50070')
    
    import posixpath as psp
    fpaths = [
      psp.join("hdfs://localhost:9000" + dpath, fname)
      for dpath, _, fnames in client.walk('/eta/myHdfsPath')
      for fname in fnames
    ]
    # At this point fpaths contains all hdfs files 
    
    parquetFile = sqlContext.read.parquet(*fpaths)
    
    
    import pandas
    pdf = parquetFile.toPandas()
    # display the contents nicely formatted.
    pdf
    
    0 讨论(0)
  • 2020-12-03 15:56

    For ORC

    spark.read.orc("/dir1/*","/dir2/*")
    

    spark goes inside dir1/ and dir2/ folder and load all the ORC files.

    For Parquet,

    spark.read.parquet("/dir1/*","/dir2/*")
    
    0 讨论(0)
提交回复
热议问题