spark-streaming

Spark 2.1 Structured Streaming - Using Kakfa as source with Python (pyspark)

烂漫一生 提交于 2020-01-04 08:22:05
问题 With Apache Spark version 2.1, I would like to use Kafka (0.10.0.2.5) as source for Structured Streaming with pyspark: kafka_app.py: from pyspark.sql import SparkSession spark=SparkSession.builder.appName("TestKakfa").getOrCreate() kafka=spark.readStream.format("kafka") \ .option("kafka.bootstrap.servers","localhost:6667") \ .option("subscribe","mytopic").load() I launched the app in the following way: ./bin/spark-submit kafka_app.py --master local[4] --jars spark-streaming-kafka-0-10

Spark 2.1 Structured Streaming - Using Kakfa as source with Python (pyspark)

不打扰是莪最后的温柔 提交于 2020-01-04 08:21:49
问题 With Apache Spark version 2.1, I would like to use Kafka (0.10.0.2.5) as source for Structured Streaming with pyspark: kafka_app.py: from pyspark.sql import SparkSession spark=SparkSession.builder.appName("TestKakfa").getOrCreate() kafka=spark.readStream.format("kafka") \ .option("kafka.bootstrap.servers","localhost:6667") \ .option("subscribe","mytopic").load() I launched the app in the following way: ./bin/spark-submit kafka_app.py --master local[4] --jars spark-streaming-kafka-0-10

spark streaming application and kafka log4j appender issue

倖福魔咒の 提交于 2020-01-04 07:17:20
问题 I am testing my spark streaming application, and I have multiple functions in my code: - some of them operate on a DStream[RDD[XXX]], some of them on RDD[XXX] (after I do DStream.foreachRDD). I use Kafka log4j appender to log business cases that occur within my functions, that operate on both DStream[RDD] & RDD it self. But data gets appended to Kafka only when from functions that operate on RDD -> it doesn't work when I want to append data to kafka from my functions that operate on DStream.

Apache Zeppelin 0.6.1: Run Spark 2.0 Twitter Stream App

送分小仙女□ 提交于 2020-01-04 06:34:32
问题 I have a cluster with Spark 2.0 and Zeppelin 0.6.1 installed. Since the class TwitterUtils.scala is moved from Spark project to Apache Bahir, I can't use the TwitterUtils in my Zeppelin notebook anymore. Here the snippets of my notebook: Dependency loading: %dep z.reset z.load("org.apache.bahir:spark-streaming-twitter_2.11:2.0.0") DepInterpreter(%dep) deprecated. Remove dependencies and repositories through GUI interpreter menu instead. DepInterpreter(%dep) deprecated. Load dependency through

filter the lines by two words Spark Streaming

陌路散爱 提交于 2020-01-04 05:36:28
问题 Is there a way to filter with one expression the lines containing a word "word1" or the other "word2" something like : val res = lines.filter(line => line.contains("word1" or "word2")) because this expression doesn't work. Thank you in advance 回答1: If line is a String optimal choice would regexp: val pattern = "word1|word2".r lines.filter(line => pattern.findFirstIn(line).isDefined) otherwise (other sequence type) you can use Seq.exists: lines.filter(line => Seq("foo", "bar").exists(s => line

Unresolved dependencies path for SBT project in IntelliJ

▼魔方 西西 提交于 2020-01-03 17:47:09
问题 I'm using IntelliJ to develop Spark application. I'm following this instruction on how to make intellij work nicely with SBT project. As my whole team is using IntelliJ so we can just modify build.sbt but we got this unresolved dependencies error Error:Error while importing SBT project: [info] Resolving org.apache.thrift#libfb303;0.9.2 ... [info] Resolving org.apache.spark#spark-streaming_2.10;2.1.0 ... [info] Resolving org.apache.spark#spark-streaming_2.10;2.1.0 ... [info] Resolving org

Apache Phoenix (4.3.1 and 4.4.0-HBase-0.98) on Spark 1.3.1 ClassNotFoundException

北战南征 提交于 2020-01-03 17:00:57
问题 I'm trying to connect to Phoenix via Spark and I keep getting the following exception when opening a connection via the JDBC driver (cut for brevity, full stacktrace below): Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.ipc.controller.ClientRpcControllerFactory at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) The class in question is provided

How to update rdd periodically in spark streaming

风流意气都作罢 提交于 2020-01-03 15:26:12
问题 My code is something like: sc = SparkContext() ssc = StreamingContext(sc, 30) initRDD = sc.parallelize('path_to_data') lines = ssc.socketTextStream('localhost', 9999) res = lines.transform(lambda x: x.join(initRDD)) res.pprint() And my question is that initRDD need to be updated every day in midnight . I try to this way: sc = SparkContext() ssc = StreamingContext(sc, 30) lines = ssc.socketTextStream('localhost', 9999) def func(rdd): initRDD = rdd.context.parallelize('path_to_data') return rdd

How to read .csv file using spark-shell

。_饼干妹妹 提交于 2020-01-03 06:36:23
问题 I am using a spark standalone with hadoop prebuilt. I was wondering what library I should import in order to let me read a .csv file? I found one library from github: https://github.com/tototoshi/scala-csv But when I typed import com.github.tototoshi.csv._ as illustrated in readme, it doesn't work. Should I do something else before importing it maybe something like buiding it using sbt first? I tried to build using sbt and it doesn't work either (what I did is following the step in the last

Null Pointer Exception When Trying to Use Persisted Table in Spark Streaming

强颜欢笑 提交于 2020-01-03 05:46:26
问题 I am creating "gpsLookUpTable" at the beginning and persisting it so that i do not need to pull it over and over again to do mapping. However, when i try to access it inside foreach i get null pointer exception. Any help is appreciated thanks. Below is code snippets: def main(args: Array[String]): Unit = { val conf = new SparkConf() ... val sc = new SparkContext(conf) val ssc = new StreamingContext(sc, Seconds(20)) val sqc = new SQLContext(sc) //////Trying to cache table here to use it below