I have many gzipped files stored on S3 which are organized by project and hour per day, the pattern of the paths of the files is as:
s3:///proj
Note: Under Spark 1.2, the proper format would be as follows:
val rdd = sc.textFile("s3n://<bucket>/<foo>/bar.*.gz")
That's s3n://
, not s3://
You'll also want to put your credentials in conf/spark-env.sh
as AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
.
The underlying Hadoop API that Spark uses to access S3 allows you specify input files using a glob expression.
From the Spark docs:
All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use
textFile("/my/directory")
,textFile("/my/directory/*.txt")
, andtextFile("/my/directory/*.gz")
.
So in your case you should be able to open all those files as a single RDD using something like this:
rdd = sc.textFile("s3://bucket/project1/20141201/logtype1/logtype1.*.gz")
Just for the record, you can also specify files using a comma-delimited list, and you can even mix that with the *
and ?
wildcards.
For example:
rdd = sc.textFile("s3://bucket/201412??/*/*.gz,s3://bucket/random-file.txt")
Briefly, what this does is:
*
matches all strings, so in this case all gz
files in all folders under 201412??
will be loaded.?
matches a single character, so 201412??
will cover all days in December 2014 like 20141201
, 20141202
, and so forth.,
lets you just load separate files at once into the same RDD, like the random-file.txt
in this case.Some notes about the appropriate URL scheme for S3 paths:
s3a://
is the way to go.s3a://
. You should only use s3n://
if you're running Spark on Hadoop 2.6 or older.Using AWS EMR with Spark 2.0.0 and SparkR in RStudio I've managed to read the gz compressed wikipedia stat files stored in S3 using the below command:
df <- read.text("s3://<bucket>/pagecounts-20110101-000000.gz")
Similarly, for all files under 'Jan 2011' you can use the above command like below:
df <- read.text("s3://<bucket>/pagecounts-201101??-*.gz")
See the SparkR API docs for more ways of doing it. https://spark.apache.org/docs/latest/api/R/read.text.html