问题
I am trying to connect to Amazon Kinesis Stream from Google Dataproc but am only getting Empty RDDs.
Command: spark-submit --verbose --packages org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.2 demo_kinesis_streaming.py --awsAccessKeyId XXXXX --awsSecretKey XXXX
Detailed Log: https://gist.github.com/sshrestha-datalicious/e3fc8ebb4916f27735a97e9fcc42136c
More Details
Spark 1.6.1
Hadoop 2.7.2
Assembly Used: /usr/lib/spark/lib/spark-assembly-1.6.1-hadoop2.7.2.jar
Surprisingly that works when I download and use the assembly containing SPARK 1.6.1 with Hadoop 2.6.0 with the following command.
Command: SPARK_HOME=/opt/spark-1.6.1-bin-hadoop2.6 spark-submit --verbose --packages org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.2 demo_kinesis_streaming.py --awsAccessKeyId XXXXX --awsSecretKey XXXX
I am not sure if there is any version conflict between the two hadoop versions and Kinesis ASL or it has to do with custom settings with Google Dataproc.
Any help would be appreciated.
Thanks
Suren
回答1:
Our team was in a similar situation and we managed to solve it:
We are running on the same environment:
- DataProc Image Version 1 with Spark 1.6.1 with Hadoop 2.7
A simple SparkStream Kinesis Script that boils down to this:
# Run the script as # spark-submit \ # --packages org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.1\ # demo_kinesis_streaming.py\ # --awsAccessKeyId FOO\ # --awsSecretKey BAR\ # ... import argparse from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.storagelevel import StorageLevel from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream ap = argparse.ArgumentParser() ap.add_argument('--awsAccessKeyId', required=True) ap.add_argument('--awsSecretKey', required=True) ap.add_argument('--stream_name') ap.add_argument('--region') ap.add_argument('--app_name') ap = ap.parse_args() kinesis_application_name = ap.app_name kinesis_stream_name = ap.stream_name kinesis_region = ap.region kinesis_endpoint_url = 'https://kinesis.{}.amazonaws.com'.format(ap.region) spark_context = SparkContext(appName=kinesis_application_name) streamingContext = StreamingContext(spark_context, 60) kinesisStream = KinesisUtils.createStream( ssc=streamingContext, kinesisAppName=kinesis_application_name, streamName=kinesis_stream_name, endpointUrl=kinesis_endpoint_url, regionName=kinesis_region, initialPositionInStream=InitialPositionInStream.TRIM_HORIZON, checkpointInterval=60, storageLevel=StorageLevel.MEMORY_AND_DISK_2, awsAccessKeyId=ap.awsAccessKeyId, awsSecretKey=ap.awsSecretKey ) kinesisStream.pprint() streamingContext.start() streamingContext.awaitTermination()
The code had been tested working on AWS EMR and on local environment using the same Spark 1.6.1 with Hadoop 2.7 setup.
- The script is returning empty RDDs without printing any error while there is data in the Kinesis stream on DataProc.
- We've tested it on DataProc with the following envs, and none of them worked.
- Submit job via
gcloud
command; ssh
into Cluster Master Node and run inyarn
client mode;ssh
into Cluster Master Node and run aslocal[*]
.
- Submit job via
Upon enabling verbose logging by updating /etc/spark/conf/log4.properties
with the following value:
log4j.rootCategory=DEBUG, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n
log4j.logger.org.eclipse.jetty=ERROR
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=DEBUG
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=DEBUG
log4j.logger.org.apache.spark=DEBUG
log4j.logger.org.apache.hadoop.conf.Configuration.deprecation=DEBUG
log4j.logger.org.spark-project.jetty.server.handler.ContextHandler=DEBUG
log4j.logger.org.apache=DEBUG
log4j.logger.com.amazonaws=DEBUG
We've notice something weird in the log(Note that spark-streaming-kinesis-asl_2.10:1.6.1
uses aws-sdk-java/1.9.37
as dependence while somehow aws-sdk-java/1.7.4
was used [suggested by user-agent]):
16/07/10 06:30:16 DEBUG com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardConsumer: PROCESS task encountered execution exception:
java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: com.amazonaws.services.kinesis.model.GetRecordsResult.getMillisBehindLatest()Ljava/lang/Long;
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardConsumer.checkAndSubmitNextTask(ShardConsumer.java:137)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardConsumer.consumeShard(ShardConsumer.java:126)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker.run(Worker.java:334)
at org.apache.spark.streaming.kinesis.KinesisReceiver$$anon$1.run(KinesisReceiver.scala:174)
Caused by: java.lang.NoSuchMethodError: com.amazonaws.services.kinesis.model.GetRecordsResult.getMillisBehindLatest()Ljava/lang/Long;
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.call(ProcessTask.java:119)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:48)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:23)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
content-length:282
content-type:application/x-amz-json-1.1
host:kinesis.ap-southeast-2.amazonaws.com
user-agent:SparkDemo,amazon-kinesis-client-library-java-1.4.0, aws-sdk-java/1.7.4 Linux/3.16.0-4-amd64 OpenJDK_64-Bit_Server_VM/25.91-b14/1.8.0_91
x-amz-date:20160710T063016Z
x-amz-target:Kinesis_20131202.GetRecords
It appears that DataProc had build its own Spark with a much older AWS SDK as dependencies and it will blow up when used in conjunction with codes that requires much newer version of AWS SDK although we are not sure exactly which module had cause this error.
Update: Base on @DennisHuo's comment, this behaviour is caused by Hadoop's leaky classpath: https://github.com/apache/hadoop/blob/branch-2.7.2/hadoop-project/pom.xml#L650
To make things worst, the AWS KCL 1.4.0 (used by Spark 1.6.1) will suppress any runtime error silently instead of throwing RuntimeException
and causing a lot of headache while debugging.
Eventually Our solution was to build our org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.1
with all of its com.amazonaws.*
shaded.
Building the JAR with the following pom (update spark/extra/kinesis-asl/pom.xml
) and shit the new JAR with --jars
flag in spark-submit
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.apache.spark</groupId>
<artifactId>spark-parent_2.10</artifactId>
<version>1.6.1</version>
<relativePath>../../pom.xml</relativePath>
</parent>
<!-- Kinesis integration is not included by default due to ASL-licensed code. -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kinesis-asl_2.10</artifactId>
<packaging>jar</packaging>
<name>Spark Kinesis Integration</name>
<properties>
<sbt.project.name>streaming-kinesis-asl</sbt.project.name>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${project.version}</version>
<type>test-jar</type>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.binary.version}</artifactId>
<version>${project.version}</version>
<type>test-jar</type>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>amazon-kinesis-client</artifactId>
<version>${aws.kinesis.client.version}</version>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>amazon-kinesis-producer</artifactId>
<version>${aws.kinesis.producer.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.mockito</groupId>
<artifactId>mockito-core</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.scalacheck</groupId>
<artifactId>scalacheck_${scala.binary.version}</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-test-tags_${scala.binary.version}</artifactId>
</dependency>
</dependencies>
<build>
<outputDirectory>target/scala-${scala.binary.version}/classes</outputDirectory>
<testOutputDirectory>target/scala-${scala.binary.version}/test-classes</testOutputDirectory>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<configuration>
<shadedArtifactAttached>false</shadedArtifactAttached>
<artifactSet>
<includes>
<!-- At a minimum we must include this to force effective pom generation -->
<include>org.spark-project.spark:unused</include>
<include>com.amazonaws:*</include>
</includes>
</artifactSet>
<relocations>
<relocation>
<pattern>com.amazonaws</pattern>
<shadedPattern>foo.bar.YO.com.amazonaws</shadedPattern>
<includes>
<include>com.amazonaws.**</include>
</includes>
</relocation>
</relocations>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
来源:https://stackoverflow.com/questions/38237345/kinesis-stream-with-empty-records-in-google-dataproc-with-spark-1-6-1-hadoop-2-7