I have a List
data. Something like:
[[dev, engg, 10000], [kar
Task can be completed without JSON, on Scala:
val data = List("dev, engg, 10000", "karthik, engg, 20000")
val intialRdd = sparkContext.parallelize(data)
val splittedRDD = intialRdd.map(current => {
val array = current.split(",")
(array(0), array(1), array(2))
})
import sqlContext.implicits._
val dataframe = splittedRDD.toDF("name", "degree", "salary")
dataframe.show()
Output is:
+-------+------+------+
| name|degree|salary|
+-------+------+------+
| dev| engg| 10000|
|karthik| engg| 20000|
+-------+------+------+
Note: (array(0), array(1), array(2)) is a Scala Tuple
DataFrame createNGramDataFrame(JavaRDD<String> lines) {
JavaRDD<Row> rows = lines.map(new Function<String, Row>(){
private static final long serialVersionUID = -4332903997027358601L;
@Override
public Row call(String line) throws Exception {
return RowFactory.create(line.split("\\s+"));
}
});
StructType schema = new StructType(new StructField[] {
new StructField("words",
DataTypes.createArrayType(DataTypes.StringType), false,
Metadata.empty()) });
DataFrame wordDF = new SQLContext(jsc).createDataFrame(rows, schema);
// build a bigram language model
NGram transformer = new NGram().setInputCol("words")
.setOutputCol("ngrams").setN(2);
DataFrame ngramDF = transformer.transform(wordDF);
ngramDF.show(10, false);
return ngramDF;
}
You can create DataFrame from List<String>
and then use selectExpr
and split
to get desired DataFrame.
public class SparkSample{
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("SparkSample").setMaster("local[*]");
JavaSparkContext jsc = new JavaSparkContext(conf);
SQLContext sqc = new SQLContext(jsc);
// sample data
List<String> data = new ArrayList<String>();
data.add("dev, engg, 10000");
data.add("karthik, engg, 20000");
// DataFrame
DataFrame df = sqc.createDataset(data, Encoders.STRING()).toDF();
df.printSchema();
df.show();
// Convert
DataFrame df1 = df.selectExpr("split(value, ',')[0] as name", "split(value, ',')[1] as degree","split(value, ',')[2] as salary");
df1.printSchema();
df1.show();
}
}
You will get below output.
root
|-- value: string (nullable = true)
+--------------------+
| value|
+--------------------+
| dev, engg, 10000|
|karthik, engg, 20000|
+--------------------+
root
|-- name: string (nullable = true)
|-- degree: string (nullable = true)
|-- salary: string (nullable = true)
+-------+------+------+
| name|degree|salary|
+-------+------+------+
| dev| engg| 10000|
|karthik| engg| 20000|
+-------+------+------+
The sample data you have provided has empty spaces. If you want to remove space and have the salary type as "integer" then you can use trim
and cast
function like below.
df1 = df1.select(trim(col("name")).as("name"),trim(col("degree")).as("degree"),trim(col("salary")).cast("integer").as("salary"));