Import pyspark dataframe from multiple S3 buckets, with a column denoting which bucket the entry came from

*爱你&永不变心* 提交于 2020-01-06 05:23:07

问题


I have a list of S3 buckets partitioned by date. The first bucket titled 2019-12-1, the second 2019-12-2, etc.

Each of these buckets stores parquet files that I am reading into a pyspark dataframe. The pyspark dataframe generated from each of these buckets has the exact same schema. What I would like to do is iterate over these buckets, and store all of these parquet files into a single pyspark dataframe that has a date column denoting what bucket each entry in the dataframe actually came from.

Because the schema of the dataframe generated when importing each bucket separately is many layers deep (i.e. each row contains structs of arrays of structs etc.), I imagine the only way to combine all the buckets into one dataframe is to have a dataframe with a single 'dates' column. Each row of the 'dates' column will hold the contents of the corresponding S3 bucket for that date.

I can read all the dates with this line:

df = spark.read.parquet("s3://my_bucket/*")

I've seen someone achieve what I'm describing by appending a 'withColumn' call to this line making a 'dates' column, but I can't remember how.


回答1:


Using input_file_name() you can extract the S3 bucket name from the file path:

df.withColumn("dates", split(regexp_replace(input_file_name(), "s3://", ""), "/").getItem(0))\
  .show()

We split the filename and get the first part that corresponds to the bucket name.

This can also be done using a regex s3:\/\/(.+?)\/(.+), the first group is the bucket name:

df.withColumn("dates", regexp_extract(input_file_name(), "s3:\/\/(.+?)\/(.+)", 1)).show()


来源:https://stackoverflow.com/questions/59349181/import-pyspark-dataframe-from-multiple-s3-buckets-with-a-column-denoting-which

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!