Distributed loading of a wide row into Spark from Cassandra

后端 未结 1 1450
攒了一身酷
攒了一身酷 2021-01-15 22:49

Let\'s assume we have a Cassandra cluster with RF = N and a table containing wide rows.

Our table could have an index something like this: pk / ck1 / ck2 / ...

相关标签:
1条回答
  • For the sake of future reference, I will explain how I solved this.

    I actually used a slightly different method to the one outlined above, one which does not involve calling Cassandra from inside Spark tasks.

    I started off with ck_list, a list of distinct values for the first cluster key when pk = PK. The code is not shown here, but I actually downloaded this list directly from Cassandra in the Spark driver using CQL.

    I then transform ck_list into a list of RDDS. Next we combine the RDDs (each one representing a Cassandra row slice) into one unified RDD (wide_row).

    The cast on CassandraRDD is necessary because union returns type org.apache.spark.rdd.RDD

    After running the job I was able to verify that the wide_row had x partitions where x is the size of ck_list. A useful side effect is that wide_row is partitioned by the first cluster key, which is also the key I want to reduce by. Hence even more shuffling is avoided.

    I don't know if this is the best way to achieve what I wanted, but it certainly works.

    val ck_list // list first cluster key values where pk = PK
    
    val wide_row = ck_list.map( ck =>
      sc.cassandraTable(KS, TBL)
        .select("c1", "c2").where("pk = ? and ck1 = ?", PK, ck)
        .asInstanceOf[org.apache.spark.rdd.RDD] 
    ).reduce( (x, y) => x.union(y) )
    
    0 讨论(0)
提交回复
热议问题