Spark 2.2.0 - How to write/read DataFrame to DynamoDB

前端未结

关注

 2  534

温柔的废话 2021-02-15 13:31

I want my Spark application to read a table from DynamoDB, do stuff, then write the result in DynamoDB.

Read the table into a DataFrame

Right now, I can read

2条回答

清歌不尽 (楼主)

2021-02-15 14:33

I was following that "Using Spark SQL for ETL" link, and found the same "illegal cyclic reference" exception. The solution for that exception is quite simple (but it cost me 2 days to figure out) as below. The key point is to use map function on the RDD of the dataframe, not the dataframe itself.

val ddbConf = new JobConf(spark.sparkContext.hadoopConfiguration)
ddbConf.set("dynamodb.output.tableName", "")
ddbConf.set("dynamodb.throughput.write.percent", "1.5")
ddbConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
ddbConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")


val df_ddb =  spark.read.option("header","true").parquet("")
val schema_ddb = df_ddb.dtypes

var ddbInsertFormattedRDD = df_ddb.rdd.map(a => {
    val ddbMap = new HashMap[String, AttributeValue]()

    for (i <- 0 to schema_ddb.length - 1) {
        val value = a.get(i)
        if (value != null) {
            val att = new AttributeValue()
            att.setS(value.toString)
            ddbMap.put(schema_ddb(i)._1, att)
        }
    }

    val item = new DynamoDBItemWritable()
    item.setItem(ddbMap)

    (new Text(""), item)
}
)

ddbInsertFormattedRDD.saveAsHadoopDataset(ddbConf)

0 讨论(0)

查看其它2个回答