I want my Spark application to read a table from DynamoDB, do stuff, then write the result in DynamoDB.
Right now, I can read
I was following that "Using Spark SQL for ETL" link, and found the same "illegal cyclic reference" exception. The solution for that exception is quite simple (but it cost me 2 days to figure out) as below. The key point is to use map function on the RDD of the dataframe, not the dataframe itself.
val ddbConf = new JobConf(spark.sparkContext.hadoopConfiguration)
ddbConf.set("dynamodb.output.tableName", "")
ddbConf.set("dynamodb.throughput.write.percent", "1.5")
ddbConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
ddbConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
val df_ddb = spark.read.option("header","true").parquet("")
val schema_ddb = df_ddb.dtypes
var ddbInsertFormattedRDD = df_ddb.rdd.map(a => {
val ddbMap = new HashMap[String, AttributeValue]()
for (i <- 0 to schema_ddb.length - 1) {
val value = a.get(i)
if (value != null) {
val att = new AttributeValue()
att.setS(value.toString)
ddbMap.put(schema_ddb(i)._1, att)
}
}
val item = new DynamoDBItemWritable()
item.setItem(ddbMap)
(new Text(""), item)
}
)
ddbInsertFormattedRDD.saveAsHadoopDataset(ddbConf)