Import pyspark dataframe from multiple S3 buckets, with a column denoting which bucket the entry came from

问题

I have a list of S3 buckets partitioned by date. The first bucket titled 2019-12-1, the second 2019-12-2, etc.

Each of these buckets stores parquet files that I am reading into a pyspark dataframe. The pyspark dataframe generated from each of these buckets has the exact same schema. What I would like to do is iterate over these buckets, and store all of these parquet files into a single pyspark dataframe that has a date column denoting what bucket each entry in the dataframe actually came from.

Because the schema of the dataframe generated when importing each bucket separately is many layers deep (i.e. each row contains structs of arrays of structs etc.), I imagine the only way to combine all the buckets into one dataframe is to have a dataframe with a single 'dates' column. Each row of the 'dates' column will hold the contents of the corresponding S3 bucket for that date.

I can read all the dates with this line:

df = spark.read.parquet("s3://my_bucket/*")

I've seen someone achieve what I'm describing by appending a 'withColumn' call to this line making a 'dates' column, but I can't remember how.

回答1:

Using input_file_name() you can extract the S3 bucket name from the file path:

df.withColumn("dates", split(regexp_replace(input_file_name(), "s3://", ""), "/").getItem(0))\
  .show()

We split the filename and get the first part that corresponds to the bucket name.

This can also be done using a regex s3:\/\/(.+?)\/(.+), the first group is the bucket name:

df.withColumn("dates", regexp_extract(input_file_name(), "s3:\/\/(.+?)\/(.+)", 1)).show()

来源：https://stackoverflow.com/questions/59349181/import-pyspark-dataframe-from-multiple-s3-buckets-with-a-column-denoting-which

标签

amazon-s3

pyspark

pyspark-dataframes