Dataframe from List in Java

后端 未结 3 746
盖世英雄少女心
盖世英雄少女心 2021-01-15 17:23
  • Spark Version : 1.6.2
  • Java Version: 7

I have a List data. Something like:

[[dev, engg, 10000], [kar         


        
相关标签:
3条回答
  • 2021-01-15 17:39

    Task can be completed without JSON, on Scala:

    val data = List("dev, engg, 10000", "karthik, engg, 20000")
    val intialRdd = sparkContext.parallelize(data)
    val splittedRDD = intialRdd.map(current => {
      val array = current.split(",")
      (array(0), array(1), array(2))
    })
    import sqlContext.implicits._
    val dataframe = splittedRDD.toDF("name", "degree", "salary")
    dataframe.show()
    

    Output is:

    +-------+------+------+
    |   name|degree|salary|
    +-------+------+------+
    |    dev|  engg| 10000|
    |karthik|  engg| 20000|
    +-------+------+------+
    

    Note: (array(0), array(1), array(2)) is a Scala Tuple

    0 讨论(0)
  • 2021-01-15 17:44
    DataFrame createNGramDataFrame(JavaRDD<String> lines) {
     JavaRDD<Row> rows = lines.map(new Function<String, Row>(){
        private static final long serialVersionUID = -4332903997027358601L;
    
        @Override
        public Row call(String line) throws Exception {
            return RowFactory.create(line.split("\\s+"));
        }
     });
     StructType schema = new StructType(new StructField[] {
            new StructField("words",
                    DataTypes.createArrayType(DataTypes.StringType), false,
                    Metadata.empty()) });
     DataFrame wordDF = new SQLContext(jsc).createDataFrame(rows, schema);
     // build a bigram language model
     NGram transformer = new NGram().setInputCol("words")
            .setOutputCol("ngrams").setN(2);
     DataFrame ngramDF = transformer.transform(wordDF);
     ngramDF.show(10, false);
     return ngramDF;
    }
    
    0 讨论(0)
  • 2021-01-15 17:54

    You can create DataFrame from List<String> and then use selectExpr and split to get desired DataFrame.

    public class SparkSample{
    public static void main(String[] args) {
        SparkConf conf = new SparkConf().setAppName("SparkSample").setMaster("local[*]");
        JavaSparkContext jsc = new JavaSparkContext(conf);
        SQLContext sqc = new SQLContext(jsc);
        // sample data
        List<String> data = new ArrayList<String>();
        data.add("dev, engg, 10000");
        data.add("karthik, engg, 20000");
        // DataFrame
        DataFrame df = sqc.createDataset(data, Encoders.STRING()).toDF();
        df.printSchema();
        df.show();
        // Convert
        DataFrame df1 = df.selectExpr("split(value, ',')[0] as name", "split(value, ',')[1] as degree","split(value, ',')[2] as salary");
        df1.printSchema();
        df1.show(); 
       }
    }
    

    You will get below output.

    root
     |-- value: string (nullable = true)
    
    +--------------------+
    |               value|
    +--------------------+
    |    dev, engg, 10000|
    |karthik, engg, 20000|
    +--------------------+
    
    root
     |-- name: string (nullable = true)
     |-- degree: string (nullable = true)
     |-- salary: string (nullable = true)
    
    +-------+------+------+
    |   name|degree|salary|
    +-------+------+------+
    |    dev|  engg| 10000|
    |karthik|  engg| 20000|
    +-------+------+------+
    

    The sample data you have provided has empty spaces. If you want to remove space and have the salary type as "integer" then you can use trim and cast function like below.

    df1 = df1.select(trim(col("name")).as("name"),trim(col("degree")).‌​as("degree"),trim(co‌​l("salary")).cast("i‌​nteger").as("salary"‌​)); 
    
    0 讨论(0)
提交回复
热议问题