问题
- Spark Version : 1.6.2
- Java Version: 7
I have a List<String>
data. Something like:
[[dev, engg, 10000], [karthik, engg, 20000]..]
I know schema for this data.
name (String)
degree (String)
salary (Integer)
I tried:
JavaRDD<String> data = new JavaSparkContext(sc).parallelize(datas);
DataFrame df = sqlContext.read().json(data);
df.printSchema();
df.show(false);
Output:
root
|-- _corrupt_record: string (nullable = true)
+-----------------------------+
|_corrupt_record |
+-----------------------------+
|[dev, engg, 10000] |
|[karthik, engg, 20000] |
+-----------------------------+
Because List<String>
is not a proper JSON.
Do I need to create a proper JSON or is there any other way to do this?
回答1:
You can create DataFrame from List<String>
and then use selectExpr
and split
to get desired DataFrame.
public class SparkSample{
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("SparkSample").setMaster("local[*]");
JavaSparkContext jsc = new JavaSparkContext(conf);
SQLContext sqc = new SQLContext(jsc);
// sample data
List<String> data = new ArrayList<String>();
data.add("dev, engg, 10000");
data.add("karthik, engg, 20000");
// DataFrame
DataFrame df = sqc.createDataset(data, Encoders.STRING()).toDF();
df.printSchema();
df.show();
// Convert
DataFrame df1 = df.selectExpr("split(value, ',')[0] as name", "split(value, ',')[1] as degree","split(value, ',')[2] as salary");
df1.printSchema();
df1.show();
}
}
You will get below output.
root
|-- value: string (nullable = true)
+--------------------+
| value|
+--------------------+
| dev, engg, 10000|
|karthik, engg, 20000|
+--------------------+
root
|-- name: string (nullable = true)
|-- degree: string (nullable = true)
|-- salary: string (nullable = true)
+-------+------+------+
| name|degree|salary|
+-------+------+------+
| dev| engg| 10000|
|karthik| engg| 20000|
+-------+------+------+
The sample data you have provided has empty spaces. If you want to remove space and have the salary type as "integer" then you can use trim
and cast
function like below.
df1 = df1.select(trim(col("name")).as("name"),trim(col("degree")).as("degree"),trim(col("salary")).cast("integer").as("salary"));
回答2:
DataFrame createNGramDataFrame(JavaRDD<String> lines) {
JavaRDD<Row> rows = lines.map(new Function<String, Row>(){
private static final long serialVersionUID = -4332903997027358601L;
@Override
public Row call(String line) throws Exception {
return RowFactory.create(line.split("\\s+"));
}
});
StructType schema = new StructType(new StructField[] {
new StructField("words",
DataTypes.createArrayType(DataTypes.StringType), false,
Metadata.empty()) });
DataFrame wordDF = new SQLContext(jsc).createDataFrame(rows, schema);
// build a bigram language model
NGram transformer = new NGram().setInputCol("words")
.setOutputCol("ngrams").setN(2);
DataFrame ngramDF = transformer.transform(wordDF);
ngramDF.show(10, false);
return ngramDF;
}
回答3:
Task can be completed without JSON, on Scala:
val data = List("dev, engg, 10000", "karthik, engg, 20000")
val intialRdd = sparkContext.parallelize(data)
val splittedRDD = intialRdd.map(current => {
val array = current.split(",")
(array(0), array(1), array(2))
})
import sqlContext.implicits._
val dataframe = splittedRDD.toDF("name", "degree", "salary")
dataframe.show()
Output is:
+-------+------+------+
| name|degree|salary|
+-------+------+------+
| dev| engg| 10000|
|karthik| engg| 20000|
+-------+------+------+
Note: (array(0), array(1), array(2)) is a Scala Tuple
来源:https://stackoverflow.com/questions/43633696/dataframe-from-liststring-in-java