Parse CSV as DataFrame/DataSet with Apache Spark and Java

前端未结

关注

 4  894

灰色年华 2020-12-07 16:54

I am new to spark, and I want to use group-by & reduce to find the following from CSV (one line by employed):

  Department, Designation, costToCompany, S


      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   囚心锁ツ
                                             
                
                
                (楼主)
            
              
              
                2020-12-07 17:30
              

            
            
                        
Procedure


Create a Class (Schema) to encapsulate your structure (it’s not required for the approach B, but it would make your code easier to read if you are using Java)

public class Record implements Serializable {
  String department;
  String designation;
  long costToCompany;
  String state;
  // constructor , getters and setters  
}

Loading CVS (JSON) file

JavaSparkContext sc;
JavaRDD data = sc.textFile("path/input.csv");
//JavaSQLContext sqlContext = new JavaSQLContext(sc); // For previous versions 
SQLContext sqlContext = new SQLContext(sc); // In Spark 1.3 the Java API and Scala API have been unified


JavaRDD rdd_records = sc.textFile(data).map(
  new Function() {
      public Record call(String line) throws Exception {
         // Here you can use JSON
         // Gson gson = new Gson();
         // gson.fromJson(line, Record.class);
         String[] fields = line.split(",");
         Record sd = new Record(fields[0], fields[1], fields[2].trim(), fields[3]);
         return sd;
      }
});



At this point you have 2 approaches: 

A. SparkSQL


Register a table (using the your defined Schema Class)

JavaSchemaRDD table = sqlContext.applySchema(rdd_records, Record.class);
table.registerAsTable("record_table");
table.printSchema();

Query the table with your desired Query-group-by

JavaSchemaRDD res = sqlContext.sql("
  select department,designation,state,sum(costToCompany),count(*) 
  from record_table 
  group by department,designation,state
");

Here you would also be able to do any other query you desire, using a SQL approach


B. Spark


Mapping using a composite key: Department,Designation,State

JavaPairRDD> records_JPRDD = 
rdd_records.mapToPair(new
  PairFunction>(){
    public Tuple2> call(Record record){
      Tuple2> t2 = 
      new Tuple2>(
        record.Department + record.Designation + record.State,
        new Tuple2(record.costToCompany,1)
      );
      return t2;
}


});
reduceByKey using the composite key, summing costToCompany column, and accumulating the number of records by key

JavaPairRDD> final_rdd_records = 
 records_JPRDD.reduceByKey(new Function2, Tuple2, Tuple2>() {
    public Tuple2 call(Tuple2 v1,
    Tuple2 v2) throws Exception {
        return new Tuple2(v1._1 + v2._1, v1._2+ v2._2);
    }
});


    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它4个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复