Parse CSV as DataFrame/DataSet with Apache Spark and Java

前端 未结 4 894
灰色年华
灰色年华 2020-12-07 16:54

I am new to spark, and I want to use group-by & reduce to find the following from CSV (one line by employed):

  Department, Designation, costToCompany, S         


        
4条回答
  •  囚心锁ツ
    2020-12-07 17:30

    Procedure

    • Create a Class (Schema) to encapsulate your structure (it’s not required for the approach B, but it would make your code easier to read if you are using Java)

      public class Record implements Serializable {
        String department;
        String designation;
        long costToCompany;
        String state;
        // constructor , getters and setters  
      }
      
    • Loading CVS (JSON) file

      JavaSparkContext sc;
      JavaRDD data = sc.textFile("path/input.csv");
      //JavaSQLContext sqlContext = new JavaSQLContext(sc); // For previous versions 
      SQLContext sqlContext = new SQLContext(sc); // In Spark 1.3 the Java API and Scala API have been unified
      
      
      JavaRDD rdd_records = sc.textFile(data).map(
        new Function() {
            public Record call(String line) throws Exception {
               // Here you can use JSON
               // Gson gson = new Gson();
               // gson.fromJson(line, Record.class);
               String[] fields = line.split(",");
               Record sd = new Record(fields[0], fields[1], fields[2].trim(), fields[3]);
               return sd;
            }
      });
      

    At this point you have 2 approaches:

    A. SparkSQL

    • Register a table (using the your defined Schema Class)

      JavaSchemaRDD table = sqlContext.applySchema(rdd_records, Record.class);
      table.registerAsTable("record_table");
      table.printSchema();
      
    • Query the table with your desired Query-group-by

      JavaSchemaRDD res = sqlContext.sql("
        select department,designation,state,sum(costToCompany),count(*) 
        from record_table 
        group by department,designation,state
      ");
      
    • Here you would also be able to do any other query you desire, using a SQL approach

    B. Spark

    • Mapping using a composite key: Department,Designation,State

      JavaPairRDD> records_JPRDD = 
      rdd_records.mapToPair(new
        PairFunction>(){
          public Tuple2> call(Record record){
            Tuple2> t2 = 
            new Tuple2>(
              record.Department + record.Designation + record.State,
              new Tuple2(record.costToCompany,1)
            );
            return t2;
      }
      

      });

    • reduceByKey using the composite key, summing costToCompany column, and accumulating the number of records by key

      JavaPairRDD> final_rdd_records = 
       records_JPRDD.reduceByKey(new Function2, Tuple2, Tuple2>() {
          public Tuple2 call(Tuple2 v1,
          Tuple2 v2) throws Exception {
              return new Tuple2(v1._1 + v2._1, v1._2+ v2._2);
          }
      });
      

提交回复
热议问题