Combining csv files with mismatched columns

有些话、适合烂在心里 提交于 2020-01-13 06:30:08

问题


I need to combine multiple csv files into one object (a dataframe, I assume) but they all have mismatched columns, like so:

CSV A

store_location_key | product_key | collector_key | trans_dt | sales | units | trans_key

CSV B

collector_key | trans_dt | store_location_key | product_key | sales | units | trans_key

CSV C

collector_key | trans_dt | store_location_key |product_key | sales | units | trans_id

On top of that, I need these to match with two additional csv files that have a matching column:

Location CSV

store_location_key | region | province | city | postal_code | banner | store_num

Product CSV

product_key | sku | item_name | item_description | department | category

The data types are all consistent, i.e., the sales column is always float, store_location_key is always int, etc. Even if I convert each csv to a dataframe first, I'm not sure that a join would work (except for the last two) because of the way that the columns need to match up.


回答1:


To merge the first three CSV files, first read the separatly as DataFrames and then use union. The order and number of columns when using union matters, so first you need to add any missing columns to the DataFrames and then use select to make sure the columns are in the same order.

all_columns = ['collector_key', 'trans_dt', 'store_location_key', 'product_key', 'sales', 'units', 'trans_key', 'trans_id']

dfA = (spark.read.csv("a.csv", header=True)
  .withColumn(trans_id, lit(null))
  .select(all_columns))
dfB = (spark.read.csv("b.csv", header=True)
  .withColumn(trans_id, lit(null))
  .select(all_columns))
dfC = (spark.read.csv("c.csv", header=True)
  .withColumn(trans_key, lit(null))
  .select(all_columns))

df = dfA.union(dfB).union(dfC)

Note: If the order/number of columns were the same for the CSV files, they could easily be combined by using a single spark.read operation.

After merging the first three CSVs, the rest is easy. Both the location and the product CSV can be combined with the rest using join.

df_location = spark.read.csv("location.csv", header=True)
df_product = spark.read.csv("product.csv", header=True)

df2 = df.join(df_location, df.store_location_key == df_location.store_location_key)
df3 = df2.join(df_product, df2.product_key == df_product.product_key)


来源:https://stackoverflow.com/questions/48999381/combining-csv-files-with-mismatched-columns

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!