I need to read parquet files from multiple paths that are not parent or child directories.
for example,
dir1 ---
|
------- dir1_1
Both the parquetFile method of SQLContext
and the parquet method of DataFrameReader
take multiple paths. So either of these works:
df = sqlContext.parquetFile('/dir1/dir1_2', '/dir2/dir2_1')
or
df = sqlContext.read.parquet('/dir1/dir1_2', '/dir2/dir2_1')
A little late but I found this while I was searching and it may help someone else...
You might also try unpacking the argument list to spark.read.parquet()
paths=['foo','bar']
df=spark.read.parquet(*paths)
This is convenient if you want to pass a few blobs into the path argument:
basePath='s3://bucket/'
paths=['s3://bucket/partition_value1=*/partition_value2=2017-04-*',
's3://bucket/partition_value1=*/partition_value2=2017-05-*'
]
df=spark.read.option("basePath",basePath).parquet(*paths)
This is cool cause you don't need to list all the files in the basePath, and you still get partition inference.
In case you have a list
of files you can do:
files = ['file1', 'file2',...]
df = spark.read.parquet(*files)
Just taking John Conley's answer, and embellishing it a bit and providing the full code (used in Jupyter PySpark) as I found his answer extremely useful.
from hdfs import InsecureClient
client = InsecureClient('http://localhost:50070')
import posixpath as psp
fpaths = [
psp.join("hdfs://localhost:9000" + dpath, fname)
for dpath, _, fnames in client.walk('/eta/myHdfsPath')
for fname in fnames
]
# At this point fpaths contains all hdfs files
parquetFile = sqlContext.read.parquet(*fpaths)
import pandas
pdf = parquetFile.toPandas()
# display the contents nicely formatted.
pdf
For ORC
spark.read.orc("/dir1/*","/dir2/*")
spark goes inside dir1/ and dir2/ folder and load all the ORC files.
For Parquet,
spark.read.parquet("/dir1/*","/dir2/*")