How to get name of dataframe column in pyspark?

后端 未结 5 1012
深忆病人
深忆病人 2021-02-01 13:44

In pandas, this can be done by column.name.

But how to do the same when its column of spark dataframe?

e.g. The calling program has a spark dataframe: spark_df

相关标签:
5条回答
  • 2021-02-01 14:03

    Python

    As @numeral correctly said, column._jc.toString() works fine in case of unaliased columns.

    In case of aliased columns (i.e. column.alias("whatever") ) the alias can be extracted, even without the usage of regular expressions: str(column).split(" AS ")[1].split("`")[1] .

    I don't know Scala syntax, but I'm sure It can be done the same.

    0 讨论(0)
  • 2021-02-01 14:07

    I found the answer is very very simple...

    // It is in java, but it should be same in pyspark
    Column col = ds.col("colName"); //the column object
    String theNameOftheCol = col.toString();
    

    The variable "theNameOftheCol" is "colName"

    0 讨论(0)
  • 2021-02-01 14:08

    The only way is to go an underlying level to the JVM.

    df.col._jc.toString().encode('utf8')
    

    This is also how it is converted to a str in the pyspark code itself.

    From pyspark/sql/column.py:

    def __repr__(self):
        return 'Column<%s>' % self._jc.toString().encode('utf8')
    
    0 讨论(0)
  • 2021-02-01 14:17

    If you want the column names of your dataframe, you can use the pyspark.sql class. I'm not sure if the SDK supports explicitly indexing a DF by column name. I received this traceback:

    >>> df.columns['High'] Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: list indices must be integers, not str

    However, calling the columns method on your dataframe, which you have done, will return a list of column names:

    df.columns will return ['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']

    If you want the column datatypes, you can call the dtypes method:

    df.dtypes will return [('Date', 'timestamp'), ('Open', 'double'), ('High', 'double'), ('Low', 'double'), ('Close', 'double'), ('Volume', 'int'), ('Adj Close', 'double')]

    If you want a particular column, you'll need to access it by index:

    df.columns[2] will return 'High'

    0 讨论(0)
  • 2021-02-01 14:21

    You can get the names from the schema by doing

    spark_df.schema.names
    

    Printing the schema can be useful to visualize it as well

    spark_df.printSchema()
    
    0 讨论(0)
提交回复
热议问题