spark-streaming

Create a new dataset based given operation column

对着背影说爱祢 提交于 2020-06-30 08:39:12
问题 I am using spark-sql-2.3.1v and have the below scenario: Given a dataset: val ds = Seq( (1, "x1", "y1", "0.1992019"), (2, null, "y2", "2.2500000"), (3, "x3", null, "15.34567"), (4, null, "y4", null), (5, "x4", "y4", "0") ).toDF("id","col_x", "col_y","value") i.e. +---+-----+-----+---------+ | id|col_x|col_y| value| +---+-----+-----+---------+ | 1| x1| y1|0.1992019| | 2| null| y2|2.2500000| | 3| x3| null| 15.34567| | 4| null| y4| null| | 5| x4| y4| 0| +---+-----+-----+---------+ Requirement: I

Create a new dataset based given operation column

半腔热情 提交于 2020-06-30 08:38:59
问题 I am using spark-sql-2.3.1v and have the below scenario: Given a dataset: val ds = Seq( (1, "x1", "y1", "0.1992019"), (2, null, "y2", "2.2500000"), (3, "x3", null, "15.34567"), (4, null, "y4", null), (5, "x4", "y4", "0") ).toDF("id","col_x", "col_y","value") i.e. +---+-----+-----+---------+ | id|col_x|col_y| value| +---+-----+-----+---------+ | 1| x1| y1|0.1992019| | 2| null| y2|2.2500000| | 3| x3| null| 15.34567| | 4| null| y4| null| | 5| x4| y4| 0| +---+-----+-----+---------+ Requirement: I

Combining Two Spark Streams On Key

回眸只為那壹抹淺笑 提交于 2020-06-27 11:22:32
问题 I have two kafka streams that contain results for two parallel operations, I need a way to combine both streams so I can process the results in a single spark transform. Is this possible? (illustration below) Stream 1 {id:1,result1:True} Stream 2 {id:1,result2:False} JOIN(Stream 1, Stream 2, On "id") -> Output Stream {id:1,result1:True,result2:False} Current code that isn't working: kvs1 = KafkaUtils.createStream(sparkstreamingcontext, ZOOKEEPER, NAME+"_stream", {"test_join_1": 1}) kvs2 =

How to see the dataframe in the console (equivalent of .show() for structured streaming)?

白昼怎懂夜的黑 提交于 2020-06-17 13:35:08
问题 I'm trying to see what's coming in as my DataFrame.. here is the spark code from pyspark.sql import SparkSession import pyspark.sql.functions as psf import logging import time spark = SparkSession \ .builder \ .appName("Console Example") \ .getOrCreate() logging.info("started to listen to the host..") lines = spark \ .readStream \ .format("socket") \ .option("host", "127.0.0.1") \ .option("port", 9999) \ .load() data = lines.selectExpr("CAST(value AS STRING)") query1 = data.writeStream.format

Bulk Insert Data in HBase using Structured Spark Streaming

淺唱寂寞╮ 提交于 2020-06-09 19:01:29
问题 I'm reading data coming from a Kafka (100.000 line per second) using Structured Spark Streaming, and i'm trying to insert all the data in HBase. I'm in Cloudera Hadoop 2.6 and I'm using Spark 2.3 I tried something like I've seen here. eventhubs.writeStream .foreach(new MyHBaseWriter[Row]) .option("checkpointLocation", checkpointDir) .start() .awaitTermination() MyHBaseWriter looks like this : class AtomeHBaseWriter[RECORD] extends HBaseForeachWriter[Row] { override def toPut(record: Row): Put

Bulk Insert Data in HBase using Structured Spark Streaming

独自空忆成欢 提交于 2020-06-09 19:01:12
问题 I'm reading data coming from a Kafka (100.000 line per second) using Structured Spark Streaming, and i'm trying to insert all the data in HBase. I'm in Cloudera Hadoop 2.6 and I'm using Spark 2.3 I tried something like I've seen here. eventhubs.writeStream .foreach(new MyHBaseWriter[Row]) .option("checkpointLocation", checkpointDir) .start() .awaitTermination() MyHBaseWriter looks like this : class AtomeHBaseWriter[RECORD] extends HBaseForeachWriter[Row] { override def toPut(record: Row): Put

Spark Streaming Kinesis partition key and sequence number log in java

房东的猫 提交于 2020-06-09 04:54:07
问题 We are using spark 2.4.3 in java. We would like to log the partition key and the sequence number of every event. The overloaded create stream function of the kinesis utils always throws a compilation error. Function<Record,Record> printSeq = s -> s; KinesisUtils.createStream( jssc, appName, streamName, endPointUrl, regionName, InitialPositionInStream.TRIM_HORIZON, kinesisCheckpointInterval, StorageLevel.MEMORY_AND_DISK_SER(), printSeq, Record.class); The exception is as follows: no suitable

Spark fails with NoClassDefFoundError for org.apache.kafka.common.serialization.StringDeserializer

最后都变了- 提交于 2020-06-08 15:04:04
问题 I am developing a generic Spark application that listens to a Kafka stream using Spark and Java. I am using kafka_2.11-0.10.2.2, spark-2.3.2-bin-hadoop2.7 - I also tried several other kafka/spark combinations before posting this question. The code fails at loading StringDeserializer class: SparkConf sparkConf = new SparkConf().setAppName("JavaDirectKafkaWordCount"); JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2)); Set<String> topicsSet = new HashSet<>();

Spark not able to find checkpointed data in HDFS after executor fails

泪湿孤枕 提交于 2020-05-28 03:29:15
问题 I am sreaming data from Kafka as below: final JavaPairDStream<String, Row> transformedMessages = rtStream .mapToPair(record -> new Tuple2<String, GenericDataModel>(record.key(), record.value())) .mapWithState(StateSpec.function(updateDataFunc).numPartitions(32)).stateSnapshots() .foreachRDD(rdd -> { --logic goes here }); I have four workers threads, and multiple executors for this application, and i am trying to check fault tolerance of Spark. Since we are using mapWithState, spark is

In Spark streaming, Is it possible to upsert batch data from kafka to Hive?

旧城冷巷雨未停 提交于 2020-05-17 08:52:05
问题 My plan is: 1. using spark streaming to load data from kafka every period like 1 minute. 2. convert the data loading every 1 min into DataFrame. 3. upsert the DataFrame into a Hive table (a table storing all history data) Currently, I successfully implemented the step1-2. And I want to know if there is any practical way to realize the step3. In detail: 1. load the latest history table with a certain partition in spark streaming. 2. use batch DataFrame to join the history table/DataFrame with