I have a CSV file with below data :
1,2,5
2,4
2,3
I want to load them into a Dataframe having schema of string of array
The outpu
you can use VectorAssembler class to create as array of features, which is particulary useful with pipelines:
val assembler = new VectorAssembler()
.setInputCols(Array("city", "status", "vendor"))
.setOutputCol("features")
https://spark.apache.org/docs/2.2.0/ml-features.html#vectorassembler
Below is the sample code in Java. You need to read your file using spark.read().text(String path)
method and then call the split
function.
import static org.apache.spark.sql.functions.split;
public class SparkSample {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("SparkSample")
.master("local[*]")
.getOrCreate();
//Read file
Dataset<Row> ds = spark.read().text("c://tmp//sample.csv").toDF("value");
ds.show(false);
Dataset<Row> ds1 = ds.select(split(ds.col("value"), ",")).toDF("new_value");
ds1.show(false);
ds1.printSchema();
}
}