How to improve performance my spark job here to load data into cassandra table?

帅比萌擦擦* 提交于 2019-12-13 03:48:48

问题


I am using spark-sql-2.4.1 ,spark-cassandra-connector_2.11-2.4.1 with java8 and apache cassandra 3.0 version.

I have my spark-submit or spark cluster enviroment as below to load 2 billion records.

--executor-cores 3 
--executor-memory 9g 
--num-executors 5 
--driver-cores 2 
--driver-memory 4g 

I am using Cassandra 6 node cluster with below settings :

 cassandra.output.consistency.level=ANY
cassandra.concurrent.writes=1500
cassandra.output.batch.size.bytes=2056
cassandra.output.batch.grouping.key=partition 
cassandra.output.batch.grouping.buffer.size=3000
cassandra.output.throughput_mb_per_sec=128
cassandra.connection.keep_alive_ms=30000
cassandra.read.timeout_ms=600000

I am loading using spark dataframe into cassandra tables. After reading into spark data set I am grouping by on certain columns as below.

Dataset<Row> dataDf = //read data from source i.e. hdfs file which are already partitioned based "load_date", "fiscal_year" , "fiscal_quarter" , "id",  "type","type_code"

Dataset<Row> groupedDf = dataDf.groupBy("id","type","value" ,"load_date","fiscal_year","fiscal_quarter" , "create_user_txt", "create_date")



 groupedDf.write().format("org.apache.spark.sql.cassandra")
    .option("table","product")
    .option("keyspace", "dataload")
    .mode(SaveMode.Append)
    .save();

Cassandra table(
    PRIMARY KEY (( id, type, value, item_code ), load_date)
) WITH CLUSTERING ORDER BY ( load_date DESC )

Basically I am groupBy "id","type","value" ,"load_date" columns. As the other columns ( "fiscal_year","fiscal_quarter" , "create_user_txt", "create_date") should be available for storing into cassandra table I have to include them also in the groupBy clause.

1) Frankly speaking I dont know how to get those columns after groupBy into resultant dataframe i.e groupedDf to store. Any advice here to how to tackle this please ?

2) With above process/steps , my spark job for loading is pretty slow due to lot of shuffling i.e. read shuffle and write shuffle processes.

What should I do here to improve the speed ?

While reading from source (into dataDf) do I need to do anything here to improve performance? This is already partitioned.

Should I still need to do any partitioning ? If so , what is the best way/approach given the above cassandra table?

HDFS file columns

"id","type","value","type_code","load_date","item_code","fiscal_year","fiscal_quarter","create_date","last_update_date","create_user_txt","update_user_txt"

Pivoting

I am using groupBy due to pivoting as below

Dataset<Row> pivot_model_vals_unpersist_df =  model_vals_df.groupBy("id","type","value","type_code","load_date","item_code","fiscal_year","fiscal_quarter","create_date")
                .pivot("type_code" )
                .agg(  first(//business logic)
                )
              )

Please advice. Your advice/feedback are highly thankful.


回答1:


So, as I got from comments your task is next:

  1. Take 2b rows from HDFS.

  2. Save this rows into Cassandra with some conversion.

  3. Schema of Cassandra table is not the same as schema of HDFS dataset.

At first, you definitely don't need group by. GROUP BY doesn't group columns, it group rows invoking some aggregate function like sum, avg, max, etc. Semantic is similar to SQL "group by", so it's no your case. What you really need - make your "to save" dataset fit into desired Cassandra schema.

In Java this is a little bit trickier than in Scala. At first I suggest to define a bean that would represent a Cassandra row.

public class MyClass {

   // Remember to declare no-args constructor
   public MyClass() { }

   private Long id;
   private String type;
   // another fields, getters, setters, etc
}

Your dataset is Dataset, you need to convert it into JavaRDD. So, you need a convertor.

public class MyClassFabric {
   public static MyClass fromRow(Row row) {
       MyClass myClass = new MyClass();
       myClass.setId(row.getInt("id"));
       // ....
       return myClass;
   }
}

In result we would have something like this:

JavaRDD<MyClass> rdd = dataDf.toJavaRDD().map(MyClassFabric::fromRow);
javaFunctions(rdd).writerBuilder("keyspace", "table", 
  mapToRow(MyClass.class)).saveToCassandra();

For additional info you can take a look https://github.com/datastax/spark-cassandra-connector/blob/master/doc/7_java_api.md



来源:https://stackoverflow.com/questions/57684972/how-to-improve-performance-my-spark-job-here-to-load-data-into-cassandra-table

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!