Spark 2.2.0 - How to write/read DataFrame to DynamoDB

前端 未结 2 535
温柔的废话
温柔的废话 2021-02-15 13:31

I want my Spark application to read a table from DynamoDB, do stuff, then write the result in DynamoDB.

Read the table into a DataFrame

Right now, I can read

相关标签:
2条回答
  • 2021-02-15 14:10

    This is somewhat simpler working example.

    For Writing to DynamoDB from Kinesis Stream for Example using Hadoop RDD:-

    https://github.com/kali786516/Spark2StructuredStreaming/blob/master/src/main/scala/com/dataframe/part11/kinesis/consumer/KinesisSaveAsHadoopDataSet/TransactionConsumerDstreamToDynamoDBHadoopDataSet.scala

    For reading from DynamoDB using Hadoop RDD and using spark SQL without regex.

    val ddbConf = new JobConf(spark.sparkContext.hadoopConfiguration)
        //ddbConf.set("dynamodb.output.tableName", "student")
        ddbConf.set("dynamodb.input.tableName", "student")
        ddbConf.set("dynamodb.throughput.write.percent", "1.5")
        ddbConf.set("dynamodb.endpoint", "dynamodb.us-east-1.amazonaws.com")
        ddbConf.set("dynamodb.regionid", "us-east-1")
        ddbConf.set("dynamodb.servicename", "dynamodb")
        ddbConf.set("dynamodb.throughput.read", "1")
        ddbConf.set("dynamodb.throughput.read.percent", "1")
        ddbConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
        ddbConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
        //ddbConf.set("dynamodb.awsAccessKeyId", credentials.getAWSAccessKeyId)
        //ddbConf.set("dynamodb.awsSecretAccessKey", credentials.getAWSSecretKey)
    
    
    val data = spark.sparkContext.hadoopRDD(ddbConf, classOf[DynamoDBInputFormat], classOf[Text], classOf[DynamoDBItemWritable])
    
    val simple2: RDD[(String)] = data.map { case (text, dbwritable) => (dbwritable.toString)}
    
    spark.read.json(simple2).registerTempTable("gooddata")
    
    spark.sql("select replace(replace(split(cast(address as string),',')[0],']',''),'[','') as housenumber from gooddata").show(false)
    
    
    0 讨论(0)
  • 2021-02-15 14:33

    I was following that "Using Spark SQL for ETL" link, and found the same "illegal cyclic reference" exception. The solution for that exception is quite simple (but it cost me 2 days to figure out) as below. The key point is to use map function on the RDD of the dataframe, not the dataframe itself.

    val ddbConf = new JobConf(spark.sparkContext.hadoopConfiguration)
    ddbConf.set("dynamodb.output.tableName", "<myTableName>")
    ddbConf.set("dynamodb.throughput.write.percent", "1.5")
    ddbConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
    ddbConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
    
    
    val df_ddb =  spark.read.option("header","true").parquet("<myInputFile>")
    val schema_ddb = df_ddb.dtypes
    
    var ddbInsertFormattedRDD = df_ddb.rdd.map(a => {
        val ddbMap = new HashMap[String, AttributeValue]()
    
        for (i <- 0 to schema_ddb.length - 1) {
            val value = a.get(i)
            if (value != null) {
                val att = new AttributeValue()
                att.setS(value.toString)
                ddbMap.put(schema_ddb(i)._1, att)
            }
        }
    
        val item = new DynamoDBItemWritable()
        item.setItem(ddbMap)
    
        (new Text(""), item)
    }
    )
    
    ddbInsertFormattedRDD.saveAsHadoopDataset(ddbConf)
    
    0 讨论(0)
提交回复
热议问题