spark-streaming | 易学教程

Create a new dataset based given operation column

阅读更多关于 Create a new dataset based given operation column

问题 I am using spark-sql-2.3.1v and have the below scenario: Given a dataset: val ds = Seq( (1, "x1", "y1", "0.1992019"), (2, null, "y2", "2.2500000"), (3, "x3", null, "15.34567"), (4, null, "y4", null), (5, "x4", "y4", "0") ).toDF("id","col_x", "col_y","value") i.e. +---+-----+-----+---------+ | id|col_x|col_y| value| +---+-----+-----+---------+ | 1| x1| y1|0.1992019| | 2| null| y2|2.2500000| | 3| x3| null| 15.34567| | 4| null| y4| null| | 5| x4| y4| 0| +---+-----+-----+---------+ Requirement: I

Create a new dataset based given operation column

阅读更多关于 Create a new dataset based given operation column

Combining Two Spark Streams On Key

阅读更多关于 Combining Two Spark Streams On Key

问题 I have two kafka streams that contain results for two parallel operations, I need a way to combine both streams so I can process the results in a single spark transform. Is this possible? (illustration below) Stream 1 {id:1,result1:True} Stream 2 {id:1,result2:False} JOIN(Stream 1, Stream 2, On "id") -> Output Stream {id:1,result1:True,result2:False} Current code that isn't working: kvs1 = KafkaUtils.createStream(sparkstreamingcontext, ZOOKEEPER, NAME+"_stream", {"test_join_1": 1}) kvs2 =

How to see the dataframe in the console (equivalent of .show() for structured streaming)?

阅读更多关于 How to see the dataframe in the console (equivalent of .show() for structured streaming)?

问题 I'm trying to see what's coming in as my DataFrame.. here is the spark code from pyspark.sql import SparkSession import pyspark.sql.functions as psf import logging import time spark = SparkSession \ .builder \ .appName("Console Example") \ .getOrCreate() logging.info("started to listen to the host..") lines = spark \ .readStream \ .format("socket") \ .option("host", "127.0.0.1") \ .option("port", 9999) \ .load() data = lines.selectExpr("CAST(value AS STRING)") query1 = data.writeStream.format

Bulk Insert Data in HBase using Structured Spark Streaming

阅读更多关于 Bulk Insert Data in HBase using Structured Spark Streaming

问题 I'm reading data coming from a Kafka (100.000 line per second) using Structured Spark Streaming, and i'm trying to insert all the data in HBase. I'm in Cloudera Hadoop 2.6 and I'm using Spark 2.3 I tried something like I've seen here. eventhubs.writeStream .foreach(new MyHBaseWriter[Row]) .option("checkpointLocation", checkpointDir) .start() .awaitTermination() MyHBaseWriter looks like this : class AtomeHBaseWriter[RECORD] extends HBaseForeachWriter[Row] { override def toPut(record: Row): Put

Bulk Insert Data in HBase using Structured Spark Streaming

阅读更多关于 Bulk Insert Data in HBase using Structured Spark Streaming

Spark Streaming Kinesis partition key and sequence number log in java

阅读更多关于 Spark Streaming Kinesis partition key and sequence number log in java

问题 We are using spark 2.4.3 in java. We would like to log the partition key and the sequence number of every event. The overloaded create stream function of the kinesis utils always throws a compilation error. Function<Record,Record> printSeq = s -> s; KinesisUtils.createStream( jssc, appName, streamName, endPointUrl, regionName, InitialPositionInStream.TRIM_HORIZON, kinesisCheckpointInterval, StorageLevel.MEMORY_AND_DISK_SER(), printSeq, Record.class); The exception is as follows: no suitable

Spark fails with NoClassDefFoundError for org.apache.kafka.common.serialization.StringDeserializer

阅读更多关于 Spark fails with NoClassDefFoundError for org.apache.kafka.common.serialization.StringDeserializer

问题 I am developing a generic Spark application that listens to a Kafka stream using Spark and Java. I am using kafka_2.11-0.10.2.2, spark-2.3.2-bin-hadoop2.7 - I also tried several other kafka/spark combinations before posting this question. The code fails at loading StringDeserializer class: SparkConf sparkConf = new SparkConf().setAppName("JavaDirectKafkaWordCount"); JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2)); Set<String> topicsSet = new HashSet<>();

Spark not able to find checkpointed data in HDFS after executor fails

阅读更多关于 Spark not able to find checkpointed data in HDFS after executor fails

问题 I am sreaming data from Kafka as below: final JavaPairDStream<String, Row> transformedMessages = rtStream .mapToPair(record -> new Tuple2<String, GenericDataModel>(record.key(), record.value())) .mapWithState(StateSpec.function(updateDataFunc).numPartitions(32)).stateSnapshots() .foreachRDD(rdd -> { --logic goes here }); I have four workers threads, and multiple executors for this application, and i am trying to check fault tolerance of Spark. Since we are using mapWithState, spark is

In Spark streaming, Is it possible to upsert batch data from kafka to Hive?

阅读更多关于 In Spark streaming, Is it possible to upsert batch data from kafka to Hive?

问题 My plan is: 1. using spark streaming to load data from kafka every period like 1 minute. 2. convert the data loading every 1 min into DataFrame. 3. upsert the DataFrame into a Hive table (a table storing all history data) Currently, I successfully implemented the step1-2. And I want to know if there is any practical way to realize the step3. In detail: 1. load the latest history table with a certain partition in spark streaming. 2. use batch DataFrame to join the history table/DataFrame with