Create single row dataframe from list of list PySpark

后端 未结 3 855
傲寒
傲寒 2020-11-27 07:46

I have a data like this data = [[1.1, 1.2], [1.3, 1.4], [1.5, 1.6]] I want to create a PySpark dataframe

I already use

dataframe = SQLCo         


        
相关标签:
3条回答
  • 2020-11-27 08:09

    I find it's useful to think of the argument to createDataFrame() as a list of tuples where each entry in the list corresponds to a row in the DataFrame and each element of the tuple corresponds to a column.

    You can get your desired output by making each element in the list a tuple:

    data = [([1.1, 1.2],), ([1.3, 1.4],), ([1.5, 1.6],)]
    dataframe = sqlCtx.createDataFrame(data, ['features'])
    dataframe.show()
    #+----------+
    #|  features|
    #+----------+
    #|[1.1, 1.2]|
    #|[1.3, 1.4]|
    #|[1.5, 1.6]|
    #+----------+
    

    Or if changing the source is cumbersome, you can equivalently do:

    data = [[1.1, 1.2], [1.3, 1.4], [1.5, 1.6]]
    dataframe = sqlCtx.createDataFrame(map(lambda x: (x, ), data), ['features'])
    dataframe.show()
    #+----------+
    #|  features|
    #+----------+
    #|[1.1, 1.2]|
    #|[1.3, 1.4]|
    #|[1.5, 1.6]|
    #+----------+
    
    0 讨论(0)
  • 2020-11-27 08:22

    You need a map function to convert the tuples to array and use it in createDataFrame

    dataframe = sqlContext.createDataFrame(sc.parallelize(data).map(lambda x: [x]), ['features'])
    

    You should get as you desire

    +----------+
    |  features|
    +----------+
    |[1.1, 1.2]|
    |[1.3, 1.4]|
    |[1.5, 1.6]|
    +----------+
    
    0 讨论(0)
  • 2020-11-27 08:27

    You should use the Vector Assembler function, from your code I guess you are doing this to train a machine learning model, and vector assembler works the best for that case. You can also add the assembler in the pipeline.

    assemble_feature=VectorAssembler(inputCol=data.columns,outputCol='features')
    pipeline=Pipeline(stages=[assemble_feature])
    pipeline.fit(data).transform(data)
    
    0 讨论(0)
提交回复
热议问题