In pandas, this can be done by column.name.
But how to do the same when its column of spark dataframe?
e.g. The calling program has a spark dataframe: spark_df
Python
As @numeral correctly said, column._jc.toString()
works fine in case of unaliased columns.
In case of aliased columns (i.e. column.alias("whatever")
) the alias can be extracted, even without the usage of regular expressions: str(column).split(" AS ")[1].split("`")[1]
.
I don't know Scala syntax, but I'm sure It can be done the same.
I found the answer is very very simple...
// It is in java, but it should be same in pyspark
Column col = ds.col("colName"); //the column object
String theNameOftheCol = col.toString();
The variable "theNameOftheCol" is "colName"
The only way is to go an underlying level to the JVM.
df.col._jc.toString().encode('utf8')
This is also how it is converted to a str
in the pyspark code itself.
From pyspark/sql/column.py:
def __repr__(self):
return 'Column<%s>' % self._jc.toString().encode('utf8')
If you want the column names of your dataframe, you can use the pyspark.sql
class. I'm not sure if the SDK supports explicitly indexing a DF by column name. I received this traceback:
>>> df.columns['High']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: list indices must be integers, not str
However, calling the columns method on your dataframe, which you have done, will return a list of column names:
df.columns
will return ['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']
If you want the column datatypes, you can call the dtypes
method:
df.dtypes
will return [('Date', 'timestamp'), ('Open', 'double'), ('High', 'double'), ('Low', 'double'), ('Close', 'double'), ('Volume', 'int'), ('Adj Close', 'double')]
If you want a particular column, you'll need to access it by index:
df.columns[2]
will return 'High'
You can get the names from the schema by doing
spark_df.schema.names
Printing the schema can be useful to visualize it as well
spark_df.printSchema()