How to convert a pyspark dataframe column to numpy array

前端 未结 1 1098
无人共我
无人共我 2021-01-23 17:18

I am trying to convert a pyspark dataframe column having approximately 90 million rows into a numpy array.

I need the array as an input for scipy.optimize.minimize

相关标签:
1条回答
  • 2021-01-23 17:50

    #1

    You will have to call a .collect() in any way. To create a numpy array from the pyspark dataframe, you can use:

    adoles = np.array(df.select("Adolescent").collect()) #.reshape(-1) for 1-D array
    

    #2

    You can convert it to a pandas dataframe using toPandas(), and you can then convert it to numpy array using .values.

    pdf = df.toPandas()
    adoles = df["Adolescent"].values
    

    Or simply:

    adoles = df.select("Adolescent").toPandas().values #.reshape(-1) for 1-D array

    #3

    For distributed arrays, you can try Dask Arrays

    I haven't tested this, but assuming it would work the same as numpy (might have inconsistencies):

    import dask.array as da
    adoles = da.array(df.select("Adolescent").collect()) #.reshape(-1) for 1-D array
    
    0 讨论(0)
提交回复
热议问题