Load CSV data in to Dataframe and convert to Array using Apache Spark (Java)

后端 未结 2 1933
误落风尘
误落风尘 2021-01-21 17:20

I have a CSV file with below data :

1,2,5  
2,4  
2,3 

I want to load them into a Dataframe having schema of string of array

The outpu

相关标签:
2条回答
  • 2021-01-21 17:31

    you can use VectorAssembler class to create as array of features, which is particulary useful with pipelines:

    val assembler = new VectorAssembler()
      .setInputCols(Array("city", "status", "vendor"))
      .setOutputCol("features")
    

    https://spark.apache.org/docs/2.2.0/ml-features.html#vectorassembler

    0 讨论(0)
  • 2021-01-21 17:47

    Below is the sample code in Java. You need to read your file using spark.read().text(String path) method and then call the split function.

    import static org.apache.spark.sql.functions.split;
    
    public class SparkSample {
        public static void main(String[] args) {
            SparkSession spark = SparkSession
                    .builder()
                    .appName("SparkSample")
                    .master("local[*]")
                    .getOrCreate();
            //Read file
            Dataset<Row> ds = spark.read().text("c://tmp//sample.csv").toDF("value");
            ds.show(false);     
            Dataset<Row> ds1 = ds.select(split(ds.col("value"), ",")).toDF("new_value");
            ds1.show(false);
            ds1.printSchema();
        }
    }
    
    0 讨论(0)
提交回复
热议问题