How to get name of dataframe column in pyspark?

后端未结

关注

 5  1020

深忆病人

In pandas, this can be done by column.name.

But how to do the same when its column of spark dataframe?

e.g. The calling program has a spark dataframe: spark_df

相关标签:

5条回答

面向向阳花

2021-02-01 14:03

Python

As @numeral correctly said, column._jc.toString() works fine in case of unaliased columns.

In case of aliased columns (i.e. column.alias("whatever") ) the alias can be extracted, even without the usage of regular expressions: str(column).split(" AS ")[1].split("`")[1] .

I don't know Scala syntax, but I'm sure It can be done the same.

0 讨论(0)
发布评论:

提交评论
- 加载中...
梦谈多话

2021-02-01 14:07
I found the answer is very very simple...
```
// It is in java, but it should be same in pyspark
Column col = ds.col("colName"); //the column object
String theNameOftheCol = col.toString();
```
The variable "theNameOftheCol" is "colName"
0 讨论(0)
发布评论:

提交评论
- 加载中...
南方客

2021-02-01 14:08
The only way is to go an underlying level to the JVM.
```
df.col._jc.toString().encode('utf8')
```
This is also how it is converted to a str in the pyspark code itself.

From pyspark/sql/column.py:
```
def __repr__(self):
    return 'Column<%s>' % self._jc.toString().encode('utf8')
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
渐次进展

2021-02-01 14:17

If you want the column names of your dataframe, you can use the pyspark.sql class. I'm not sure if the SDK supports explicitly indexing a DF by column name. I received this traceback:

>>> df.columns['High'] Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: list indices must be integers, not str

However, calling the columns method on your dataframe, which you have done, will return a list of column names:

df.columns will return ['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']

If you want the column datatypes, you can call the dtypes method:

df.dtypes will return [('Date', 'timestamp'), ('Open', 'double'), ('High', 'double'), ('Low', 'double'), ('Close', 'double'), ('Volume', 'int'), ('Adj Close', 'double')]

If you want a particular column, you'll need to access it by index:

df.columns[2] will return 'High'

0 讨论(0)
发布评论:

提交评论
- 加载中...
别那么骄傲

2021-02-01 14:21
You can get the names from the schema by doing
```
spark_df.schema.names
```
Printing the schema can be useful to visualize it as well
```
spark_df.printSchema()
```
0 讨论(0)
发布评论:

提交评论
- 加载中...