Kinesis Stream with Empty Records in Google Dataproc with Spark 1.6.1 Hadoop 2.7.2

陌路散爱 提交于 2019-12-10 20:14:38

问题


I am trying to connect to Amazon Kinesis Stream from Google Dataproc but am only getting Empty RDDs.

Command: spark-submit  --verbose --packages org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.2 demo_kinesis_streaming.py --awsAccessKeyId XXXXX        --awsSecretKey XXXX

Detailed Log: https://gist.github.com/sshrestha-datalicious/e3fc8ebb4916f27735a97e9fcc42136c

More Details
Spark 1.6.1
Hadoop 2.7.2
Assembly Used: /usr/lib/spark/lib/spark-assembly-1.6.1-hadoop2.7.2.jar

Surprisingly that works when I download and use the assembly containing SPARK 1.6.1 with Hadoop 2.6.0 with the following command.

Command: SPARK_HOME=/opt/spark-1.6.1-bin-hadoop2.6 spark-submit  --verbose --packages org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.2 demo_kinesis_streaming.py --awsAccessKeyId XXXXX        --awsSecretKey XXXX

I am not sure if there is any version conflict between the two hadoop versions and Kinesis ASL or it has to do with custom settings with Google Dataproc.

Any help would be appreciated.

Thanks
Suren


回答1:


Our team was in a similar situation and we managed to solve it:

We are running on the same environment:

  • DataProc Image Version 1 with Spark 1.6.1 with Hadoop 2.7
  • A simple SparkStream Kinesis Script that boils down to this:

    # Run the script as
    # spark-submit  \
    #    --packages org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.1\
    #    demo_kinesis_streaming.py\
    #    --awsAccessKeyId FOO\
    #    --awsSecretKey BAR\
    #    ... 
    
    import argparse
    
    from pyspark import SparkContext
    from pyspark.streaming import StreamingContext
    from pyspark.storagelevel import StorageLevel
    
    from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream
    
    ap = argparse.ArgumentParser()
    ap.add_argument('--awsAccessKeyId', required=True)
    ap.add_argument('--awsSecretKey', required=True)
    ap.add_argument('--stream_name')
    ap.add_argument('--region')
    ap.add_argument('--app_name')
    ap = ap.parse_args()
    
    kinesis_application_name = ap.app_name
    kinesis_stream_name = ap.stream_name
    kinesis_region = ap.region
    kinesis_endpoint_url = 'https://kinesis.{}.amazonaws.com'.format(ap.region)
    
    spark_context = SparkContext(appName=kinesis_application_name)
    streamingContext = StreamingContext(spark_context, 60)
    
    kinesisStream = KinesisUtils.createStream(
        ssc=streamingContext,
        kinesisAppName=kinesis_application_name,
        streamName=kinesis_stream_name,
        endpointUrl=kinesis_endpoint_url,
        regionName=kinesis_region,
        initialPositionInStream=InitialPositionInStream.TRIM_HORIZON,
        checkpointInterval=60,
        storageLevel=StorageLevel.MEMORY_AND_DISK_2,
        awsAccessKeyId=ap.awsAccessKeyId,
        awsSecretKey=ap.awsSecretKey
    )
    
    kinesisStream.pprint()
    
    streamingContext.start()
    streamingContext.awaitTermination()
    
  • The code had been tested working on AWS EMR and on local environment using the same Spark 1.6.1 with Hadoop 2.7 setup.

  • The script is returning empty RDDs without printing any error while there is data in the Kinesis stream on DataProc.
  • We've tested it on DataProc with the following envs, and none of them worked.
    1. Submit job via gcloud command;
    2. ssh into Cluster Master Node and run in yarn client mode;
    3. ssh into Cluster Master Node and run as local[*].

Upon enabling verbose logging by updating /etc/spark/conf/log4.properties with the following value:

    log4j.rootCategory=DEBUG, console
    log4j.appender.console=org.apache.log4j.ConsoleAppender
    log4j.appender.console.target=System.err
    log4j.appender.console.layout=org.apache.log4j.PatternLayout
    log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n
    log4j.logger.org.eclipse.jetty=ERROR
    log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
    log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=DEBUG
    log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=DEBUG
    log4j.logger.org.apache.spark=DEBUG 
    log4j.logger.org.apache.hadoop.conf.Configuration.deprecation=DEBUG
    log4j.logger.org.spark-project.jetty.server.handler.ContextHandler=DEBUG
    log4j.logger.org.apache=DEBUG
    log4j.logger.com.amazonaws=DEBUG

We've notice something weird in the log(Note that spark-streaming-kinesis-asl_2.10:1.6.1 uses aws-sdk-java/1.9.37 as dependence while somehow aws-sdk-java/1.7.4 was used [suggested by user-agent]):

    16/07/10 06:30:16 DEBUG com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardConsumer: PROCESS task encountered execution exception:
    java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: com.amazonaws.services.kinesis.model.GetRecordsResult.getMillisBehindLatest()Ljava/lang/Long;
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.util.concurrent.FutureTask.get(FutureTask.java:192)
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardConsumer.checkAndSubmitNextTask(ShardConsumer.java:137)
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardConsumer.consumeShard(ShardConsumer.java:126)
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker.run(Worker.java:334)
        at org.apache.spark.streaming.kinesis.KinesisReceiver$$anon$1.run(KinesisReceiver.scala:174)

    Caused by: java.lang.NoSuchMethodError: com.amazonaws.services.kinesis.model.GetRecordsResult.getMillisBehindLatest()Ljava/lang/Long;
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.call(ProcessTask.java:119)
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:48)
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:23)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

    content-length:282
    content-type:application/x-amz-json-1.1
    host:kinesis.ap-southeast-2.amazonaws.com
    user-agent:SparkDemo,amazon-kinesis-client-library-java-1.4.0, aws-sdk-java/1.7.4 Linux/3.16.0-4-amd64 OpenJDK_64-Bit_Server_VM/25.91-b14/1.8.0_91
    x-amz-date:20160710T063016Z
    x-amz-target:Kinesis_20131202.GetRecords

It appears that DataProc had build its own Spark with a much older AWS SDK as dependencies and it will blow up when used in conjunction with codes that requires much newer version of AWS SDK although we are not sure exactly which module had cause this error.

Update: Base on @DennisHuo's comment, this behaviour is caused by Hadoop's leaky classpath: https://github.com/apache/hadoop/blob/branch-2.7.2/hadoop-project/pom.xml#L650

To make things worst, the AWS KCL 1.4.0 (used by Spark 1.6.1) will suppress any runtime error silently instead of throwing RuntimeException and causing a lot of headache while debugging.


Eventually Our solution was to build our org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.1 with all of its com.amazonaws.* shaded.

Building the JAR with the following pom (update spark/extra/kinesis-asl/pom.xml) and shit the new JAR with --jars flag in spark-submit

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <parent>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-parent_2.10</artifactId>
    <version>1.6.1</version>
    <relativePath>../../pom.xml</relativePath>
  </parent>

  <!-- Kinesis integration is not included by default due to ASL-licensed code. -->
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-streaming-kinesis-asl_2.10</artifactId>
  <packaging>jar</packaging>
  <name>Spark Kinesis Integration</name>

  <properties>
    <sbt.project.name>streaming-kinesis-asl</sbt.project.name>
  </properties>

  <dependencies>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_${scala.binary.version}</artifactId>
      <version>${project.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_${scala.binary.version}</artifactId>
      <version>${project.version}</version>
      <type>test-jar</type>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_${scala.binary.version}</artifactId>
      <version>${project.version}</version>
      <type>test-jar</type>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>com.amazonaws</groupId>
      <artifactId>amazon-kinesis-client</artifactId>
      <version>${aws.kinesis.client.version}</version>
    </dependency>
    <dependency>
      <groupId>com.amazonaws</groupId>
      <artifactId>amazon-kinesis-producer</artifactId>
      <version>${aws.kinesis.producer.version}</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.mockito</groupId>
      <artifactId>mockito-core</artifactId>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.scalacheck</groupId>
      <artifactId>scalacheck_${scala.binary.version}</artifactId>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-test-tags_${scala.binary.version}</artifactId>
    </dependency>
  </dependencies>

  <build>
    <outputDirectory>target/scala-${scala.binary.version}/classes</outputDirectory>
    <testOutputDirectory>target/scala-${scala.binary.version}/test-classes</testOutputDirectory>

    <plugins>
        <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-shade-plugin</artifactId>
          <configuration>
            <shadedArtifactAttached>false</shadedArtifactAttached>

            <artifactSet>
              <includes>
                <!-- At a minimum we must include this to force effective pom generation -->
                <include>org.spark-project.spark:unused</include>
                <include>com.amazonaws:*</include>
              </includes>
            </artifactSet>

            <relocations>
              <relocation>
                <pattern>com.amazonaws</pattern>
                <shadedPattern>foo.bar.YO.com.amazonaws</shadedPattern>
                <includes>
                  <include>com.amazonaws.**</include>
                </includes>
              </relocation>
            </relocations>

          </configuration>
          <executions>
            <execution>
              <phase>package</phase>
              <goals>
                <goal>shade</goal>
              </goals>
            </execution>
          </executions>
        </plugin>
    </plugins>
  </build>
</project>


来源:https://stackoverflow.com/questions/38237345/kinesis-stream-with-empty-records-in-google-dataproc-with-spark-1-6-1-hadoop-2-7

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!