spark-streaming | 易学教程

Spark streaming data pipelines on Dataproc experiencing sudden frequent socket timeouts

阅读更多关于 Spark streaming data pipelines on Dataproc experiencing sudden frequent socket timeouts

问题 I am using Spark streaming on Google Cloud Dataproc for executing a framework (written in Python) which consists of several continuous pipelines, each representing a single job on Dataproc, which basically read from Kafka queues and write the transformed output to Bigtable. All pipelines combined handle several gigabytes of data per day via 2 clusters, one with 3 worker nodes and one with 4. Running this Spark streaming framework on top of Dataproc has been fairly stable until the beginning

Spark streaming data pipelines on Dataproc experiencing sudden frequent socket timeouts

阅读更多关于 Spark streaming data pipelines on Dataproc experiencing sudden frequent socket timeouts

Custom source/sink configurations not getting recognized

阅读更多关于 Custom source/sink configurations not getting recognized

问题 I've written my custom metrics source/sink for my Spark streaming app and I am trying to initialize it from metrics.properties - but that doesn't work from executors. I don't have control on the machines in Spark cluster, so I can't copy properties file in $SPARK_HOME/conf/ in the cluster. I have it in the fat jar where my app lives, but by the time my fat jar is downloaded on worker nodes in cluster, executors are already started and their Metrics system is already initialized - thus not

Real-time data standardization / normalization with Spark structured streaming

阅读更多关于 Real-time data standardization / normalization with Spark structured streaming

问题 Standardizing / normalizing data is an essential, if not a crucial, point when it comes to implementing machine learning algorithms. Doing so on a real time manner using Spark structured streaming has been a problem I've been trying to tackle for the past couple of weeks. Using the StandardScaler estimator ((value(i)-mean) /standard deviation) on historical data proved to be great, and in my use case it is the best, to get reasonable clustering results, but I'm not sure how to fit

How to populate the cache in CachedSchemaRegistryClient without making a call to register a new schema?

阅读更多关于 How to populate the cache in CachedSchemaRegistryClient without making a call to register a new schema?

问题 we have a spark streaming application which integrates with Kafka, I'm trying to optimize it because it makes excessive calls to Schema Registry to download schema. The avro schema for our data rarely changes, and currently our application calls the Schema Registry whenever a record comes in, which is way too much. I ran into CachedSchemaRegistryClient from confluent, and it looked promising. Though after looking into its implementation I'm not sure how to use its built-in cache to reduce the

how to specify consumer group in Kafka Spark Streaming using direct stream

阅读更多关于 how to specify consumer group in Kafka Spark Streaming using direct stream

问题 How to specify consumer group id for kafka spark streaming using direct stream API. HashMap<String, String> kafkaParams = new HashMap<String, String>(); kafkaParams.put("metadata.broker.list", brokers); kafkaParams.put("auto.offset.reset", "largest"); kafkaParams.put("group.id", "app1"); JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream( jssc, String.class, String.class, StringDecoder.class, StringDecoder.class, kafkaParams, topicsSet ); though i have specified the

Unable to submit jobs to spark cluster (cluster-mode)

阅读更多关于 Unable to submit jobs to spark cluster (cluster-mode)

问题 Spark version 1.3.0 Error while submitting jobs to spark cluster in cluster mode ./spark-submit --class org.apache.spark.examples.streaming.JavaDirectKafkaWordCount --deploy-mode cluster wordcount-0.1.jar 172.20.5.174:9092,172.20.9.50:9092,172.20.7.135:9092 log Yields: Spark assembly has been built with Hive, including Datanucleus jars on classpath Running Spark using the REST application submission protocol. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15

How do I stop a spark streaming job?

阅读更多关于 How do I stop a spark streaming job?

问题 I have a Spark Streaming job which has been running continuously. How do I stop the job gracefully? I have read the usual recommendations of attaching a shutdown hook in the job monitoring and sending a SIGTERM to the job. sys.ShutdownHookThread { logger.info("Gracefully stopping Application...") ssc.stop(stopSparkContext = true, stopGracefully = true) logger.info("Application stopped gracefully") } It seems to work but does not look like the cleanest way to stop the job. Am I missing

how to handle this in spark

阅读更多关于 how to handle this in spark

问题 I am using spark-sql 2.4.x version , datastax-spark-cassandra-connector for Cassandra-3.x version. Along with kafka. I have a scenario for some finance data coming from kafka topic. data (base dataset) contains companyId, year , prev_year fields information. If columns year === prev_year then I need to join with different table i.e. exchange_rates. If columns year =!= prev_year then I need to return the base dataset itself How to do this in spark-sql ? 回答1: You can refer below approach for

Spark SQL removing white spaces

阅读更多关于 Spark SQL removing white spaces

问题 I have a simple Spark Program which reads a JSON file and emits a CSV file. IN the JSON data the values contain leading and trailing white spaces, when I emit the CSV the leading and trailing white spaces are gone. Is there a way I can retain the spaces. I tried many options like ignoreTrailingWhiteSpace , ignoreLeadingWhiteSpace but no luck input.json {"key" : "k1", "value1": "Good String", "value2": "Good String"} {"key" : "k1", "value1": "With Spaces ", "value2": "With Spaces "} {"key" :