Identify Partition Key Column from a table using PySpark

此生再无相见时 提交于 2020-02-05 03:46:09

问题


I need help to find the unique partitions column names for a Hive table using PySpark. The table might have multiple partition columns and preferable the output should return a list of the partition columns for the Hive Table.

It would be great if the result would also include the datatype of the partitioned columns.

Any suggestions will be helpful.


回答1:


It can be done using desc as shown below:

df=spark.sql("""desc test_dev_db.partition_date_table""")
>>> df.show(truncate=False)
+-----------------------+---------+-------+
|col_name               |data_type|comment|
+-----------------------+---------+-------+
|emp_id                 |int      |null   |
|emp_name               |string   |null   |
|emp_salary             |int      |null   |
|emp_date               |date     |null   |
|year                   |string   |null   |
|month                  |string   |null   |
|day                    |string   |null   |
|# Partition Information|         |       |
|# col_name             |data_type|comment|
|year                   |string   |null   |
|month                  |string   |null   |
|day                    |string   |null   |
+-----------------------+---------+-------+

Since this table was partitioned, So here you can see the partition column information along with their datatypes.

It seems your are interested in just partition column name and their respective data types. Hence I am creating a list of tuples.

partition_list=df.select(df.col_name,df.data_type).rdd.map(lambda x:(x[0],x[1])).collect()

>>> print partition_list
[(u'emp_id', u'int'), (u'emp_name', u'string'), (u'emp_salary', u'int'), (u'emp_date', u'date'), (u'year', u'string'), (u'month', u'string'), (u'day', u'string'), (u'# Partition Information', u''), (u'# col_name', u'data_type'), (u'year', u'string'), (u'month', u'string'), (u'day', u'string')]

partition_details = [partition_list[index+1:] for index,item in enumerate(partition_list) if item[0]=='# col_name']

>>> print partition_details
[[(u'year', u'string'), (u'month', u'string'), (u'day', u'string')]]

It will return empty list in case table is not partitioned. Hope this helps.



来源:https://stackoverflow.com/questions/57011003/identify-partition-key-column-from-a-table-using-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!