I have many gzipped files stored on S3 which are organized by project and hour per day, the pattern of the paths of the files is as:
s3:///proj
The underlying Hadoop API that Spark uses to access S3 allows you specify input files using a glob expression.
From the Spark docs:
All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use
textFile("/my/directory")
,textFile("/my/directory/*.txt")
, andtextFile("/my/directory/*.gz")
.
So in your case you should be able to open all those files as a single RDD using something like this:
rdd = sc.textFile("s3://bucket/project1/20141201/logtype1/logtype1.*.gz")
Just for the record, you can also specify files using a comma-delimited list, and you can even mix that with the *
and ?
wildcards.
For example:
rdd = sc.textFile("s3://bucket/201412??/*/*.gz,s3://bucket/random-file.txt")
Briefly, what this does is:
*
matches all strings, so in this case all gz
files in all folders under 201412??
will be loaded.?
matches a single character, so 201412??
will cover all days in December 2014 like 20141201
, 20141202
, and so forth.,
lets you just load separate files at once into the same RDD, like the random-file.txt
in this case.Some notes about the appropriate URL scheme for S3 paths:
s3a://
is the way to go.s3a://
. You should only use s3n://
if you're running Spark on Hadoop 2.6 or older.