I am looking for a way to select columns of my dataframe in pyspark. For the first row, I know I can use df.first()
but not sure about columns given that they do
The method select accepts a list of column names (string) or expressions (Column) as a parameter. To select columns you can use:
-- column names (strings):
df.select('col_1','col_2','col_3')
-- column objects:
import pyspark.sql.functions as F
df.select(F.col('col_1'), F.col('col_2'), F.col('col_3'))
# or
df.select(df.col_1, df.col_2, df.col_3)
# or
df.select(df['col_1'], df['col_2'], df['col_3'])
-- a list of column names or column objects:
df.select(*['col_1','col_2','col_3'])
#or
df.select(*[F.col('col_1'), F.col('col_2'), F.col('col_3')])
#or
df.select(*[df.col_1, df.col_2, df.col_3])
The star operator *
can be omitted as it's used to keep it consistent with other functions like drop
that don't accept a list as a parameter.