Selecting or removing duplicate columns from spark dataframe

问题

Given a spark dataframe, with a duplicate columns names (eg. A) for which I cannot modify the upstream or source, how do I select, remove or rename one of the columns so that I may retrieve the columns values?

df.select('A') shows me an ambiguous column error, as does filter, drop, and withColumnRenamed. How do I select one of the columns?

回答1:

The only way I found with hours of research is to rename the column set, then create another dataframe with the new set as the header.

Eg, if you have:

>>> import pyspark
>>> from pyspark.sql import SQLContext
>>> 
>>> sc = pyspark.SparkContext()
>>> sqlContext = SQLContext(sc)
>>> df = sqlContext([(1, 2, 3), (4, 5, 6)], ['a', 'b', 'a'])
DataFrame[a: bigint, b: bigint, a: bigint]
>>> df.columns
['a', 'b', 'a']
>>> df2 = df.toDF('a', 'b', 'c')
>>> df2.columns
['a', 'b', 'c']

You can get the list of columns using df.columns and then use a loop to rename any duplicates to get the new column list (don't forget to pass *new_col_list instead of new_col_list to toDF function else it'll throw an invalid count error).

来源：https://stackoverflow.com/questions/52205113/selecting-or-removing-duplicate-columns-from-spark-dataframe

标签

apache-spark

pyspark

apache-spark-sql

pyspark-sql

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!