I would like to know how to select a specific column with its number but not with its name in a dataframe ?
Like this in Pandas:
df = df.iloc[:,2]
You can always get the name of the column with df.columns[n]
and then select
it:
df = spark.createDataFrame([[1,2], [3,4]], ['a', 'b'])
To select column at position n
:
n = 1
df.select(df.columns[n]).show()
+---+
| b|
+---+
| 2|
| 4|
+---+
To select all but column n
:
n = 1
You can either use drop
:
df.drop(df.columns[n]).show()
+---+
| a|
+---+
| 1|
| 3|
+---+
Or select with manually constructed column names:
df.select(df.columns[:n] + df.columns[n+1:]).show()
+---+
| a|
+---+
| 1|
| 3|
+---+
Same solution as mirkhosro:
For a dataframe df, you can select the column n using df[n]
, where n is the index of the column.
Example:
df = df.filter(df[3]!=0)
will remove the rows of df, where the value in the fourth column is 0.
Note that you can check the columns using df.printSchema()