How can I get spark on emr-5.2.1 to write to dynamodb?

问题

According to this article here, when I create an aws emr cluster that will use spark to pipe data to dynamodb, I need to preface with the line:

spark-shell --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar

This line appears in numerous references, including from the amazon devs themselves. However, when I run create-cluster with an added --jars flag, I get this error:

Exception in thread "main" java.io.FileNotFoundException: File file:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:616)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:829)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:606)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:431)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
...

There's an answer at this SO question that the library should be included in emr-5.2.1, so I tried running my code without that extra --jars flag:

ERROR ApplicationMaster: User class threw exception: java.lang.NoClassDefFoundError: org/apache/hadoop/dynamodb/DynamoDBItemWritable
java.lang.NoClassDefFoundError: org/apache/hadoop/dynamodb/DynamoDBItemWritable
at CopyS3ToDynamoApp$.main(CopyS3ToDynamo.scala:113)
at CopyS3ToDynamoApp.main(CopyS3ToDynamo.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.dynamodb.DynamoDBItemWritable
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

Just for grins, I tried the alternative proposed by that other answer to that question by adding in --driver-class-path,/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar, to my step, and got told:

Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2702)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2715)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93)

Not being able to find s3a.S3AFileSystem seems like a big one, especially since I have other jobs that read from s3 just fine, but apparently reading from s3 and writing to dynamo is tricky. Any idea on how to solve this problem?

Update: I figured that s3 wasn't being found because I was overriding the classpath and dropping all the other libraries, so I updated classpath like so:

class_path = "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:" \
             "/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:" \
             "/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:" \
             "/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:" \
             "/usr/share/aws/emr/ddb/lib/*"

And now I get this error:

 diagnostics: User class threw exception: java.lang.NoClassDefFoundError: org/apache/hadoop/dynamodb/DynamoDBItemWritable
 ApplicationMaster host: 10.178.146.133
 ApplicationMaster RPC port: 0
 queue: default
 start time: 1484852731196
 final status: FAILED
 tracking URL: http://ip-10-178-146-68.syseng.tmcs:20888/proxy/application_1484852606881_0001/

So it looks like the library isn't in the location specified by the AWS documentation. Has anyone gotten this to work?

回答1:

OK, figuring this out took me days, so I'll spare whoever comes along next to ask this question.

The reason that these methods fail is that the path specified by the AWS folks does not exist on emr 5.2.1 clusters (and maybe not on any emr 5.0 cluster at all).

So instead, I downloaded the 4.2 version of the emr-dynamodb-hadoop jar from Maven.

Because the jar is not on the emr cluster, you're going to need to include it in your jar. If you're using sbt, you can use sbt assembly. If you don't want to have such a monolithic jar going on (and have to figure out the conflict resolution between version 1.7 and 1.8 of netbeans), you can also just merge jars as part of your build process. This way, you have one jar for your emr step that you can put on s3 for easy create-cluster based on-demand spark jobs.

回答2:

I have used https://github.com/audienceproject/spark-dynamodb for connecting spark to dynamodb on emr.There are lot of issues if you try to use Scala 2.12.X vesion, below are the configurations.

Spark 2.3.3, Scala 2.11.12, spark-dynamodb_2.11 0.4.4, guva 14.0.1.

This works on EMR emr-5.22.0 without any issue.

Sample code.

def main (args: Array[String] ): Unit = {

  val spark = SparkSession.builder
  .appName ("DynamoController1")
  .master ("local[*]")
  .getOrCreate

  val someData = Seq (
  Row (313080991, 1596115553835L, "U", "Insert", "455 E 520th Ave qqqqq", "AsutoshC", "paridaC", 1592408065),
  Row (313080881, 1596115553835L, "I", "Insert", "455 E 520th Ave qqqqq", "AsutoshC", "paridaC", 1592408060),
  Row (313080771, 1596115664774L, "U", "Update", "455 E 520th Ave odisha", "NishantC", "KanungoC", 1592408053)
  )

  val candidate_schema = StructType (Array (StructField ("candidateId", IntegerType, false), StructField ("repoCreateDate", LongType, true),
  StructField ("accessType", StringType, true), StructField ("action", StringType, true), StructField ("address1", StringType, true)
  , StructField ("firstName", StringType, true), StructField ("lastName", StringType, true), StructField ("updateDate", LongType, true) ) )

  var someDF = spark.sqlContext.createDataFrame (
  spark.sqlContext.sparkContext.parallelize (someData),
  StructType (candidate_schema) )

  someDF = someDF.withColumn ("datetype_timestamp", to_timestamp (col ("updateDate") ) )
  someDF.createOrReplaceTempView ("rawData")

  val sourceCount = someDF.select (someDF.schema.head.name).count
  logger.info (s"step [1.0.1] Fetched $sourceCount")
  someDF.show ()

  val compressedDF: DataFrame = spark.sqlContext.sql (s"Select candidateId, repoCreateDate,accessType,action,address1,firstName, lastName,updateDate from rawData ")
  compressedDF.show (20);
  compressedDF.write.dynamodb ("xcloud.Candidate")

  var dynamoDf = spark.read.dynamodb ("xcloud.Candidate")
  var dynamoDf = spark.read.dynamodbAs[candidate_schema] ("xcloud.Candidate")
  dynamoDf.show ();

}

Hope this helps someone !!!

来源：https://stackoverflow.com/questions/41735060/how-can-i-get-spark-on-emr-5-2-1-to-write-to-dynamodb

标签

scala

apache-spark

amazon-dynamodb

emr