How to convert column with string type to int form in pyspark data frame?

前端 未结 3 375
忘了有多久
忘了有多久 2020-12-24 05:33

I have dataframe in pyspark. Some of its numerical columns contain \'nan\' so when I am reading the data and checking for the schema of dataframe, those columns will have \'

相关标签:
3条回答
  • 2020-12-24 05:54

    You could use cast(as int) after replacing NaN with 0,

    data_df = df.withColumn("Plays", df.call_time.cast('float'))
    
    0 讨论(0)
  • 2020-12-24 05:59
    from pyspark.sql.types import IntegerType
    data_df = data_df.withColumn("Plays", data_df["Plays"].cast(IntegerType()))
    data_df = data_df.withColumn("drafts", data_df["drafts"].cast(IntegerType()))
    

    You can run loop for each column but this is the simplest way to convert string column into integer.

    0 讨论(0)
  • 2020-12-24 06:12

    Another way to do it is using the StructField if you have multiple fields that needs to be modified.

    Ex:

    from pyspark.sql.types import StructField,IntegerType, StructType,StringType
    newDF=[StructField('CLICK_FLG',IntegerType(),True),
           StructField('OPEN_FLG',IntegerType(),True),
           StructField('I1_GNDR_CODE',StringType(),True),
           StructField('TRW_INCOME_CD_V4',StringType(),True),
           StructField('ASIAN_CD',IntegerType(),True),
           StructField('I1_INDIV_HHLD_STATUS_CODE',IntegerType(),True)
           ]
    finalStruct=StructType(fields=newDF)
    df=spark.read.csv('ctor.csv',schema=finalStruct)
    

    Output:

    Before

    root
     |-- CLICK_FLG: string (nullable = true)
     |-- OPEN_FLG: string (nullable = true)
     |-- I1_GNDR_CODE: string (nullable = true)
     |-- TRW_INCOME_CD_V4: string (nullable = true)
     |-- ASIAN_CD: integer (nullable = true)
     |-- I1_INDIV_HHLD_STATUS_CODE: string (nullable = true)
    

    After:

    root
     |-- CLICK_FLG: integer (nullable = true)
     |-- OPEN_FLG: integer (nullable = true)
     |-- I1_GNDR_CODE: string (nullable = true)
     |-- TRW_INCOME_CD_V4: string (nullable = true)
     |-- ASIAN_CD: integer (nullable = true)
     |-- I1_INDIV_HHLD_STATUS_CODE: integer (nullable = true)
    

    This is slightly a long procedure to cast , but the advantage is that all the required fields can be done.

    It is to be noted that if only the required fields are assigned the data type, then the resultant dataframe will contain only those fields which are changed.

    0 讨论(0)
提交回复
热议问题