spark-streaming | 易学教程

Spark 2.1 Structured Streaming - Using Kakfa as source with Python (pyspark)

阅读更多关于 Spark 2.1 Structured Streaming - Using Kakfa as source with Python (pyspark)

问题 With Apache Spark version 2.1, I would like to use Kafka (0.10.0.2.5) as source for Structured Streaming with pyspark: kafka_app.py: from pyspark.sql import SparkSession spark=SparkSession.builder.appName("TestKakfa").getOrCreate() kafka=spark.readStream.format("kafka") \ .option("kafka.bootstrap.servers","localhost:6667") \ .option("subscribe","mytopic").load() I launched the app in the following way: ./bin/spark-submit kafka_app.py --master local[4] --jars spark-streaming-kafka-0-10

Spark 2.1 Structured Streaming - Using Kakfa as source with Python (pyspark)

阅读更多关于 Spark 2.1 Structured Streaming - Using Kakfa as source with Python (pyspark)

spark streaming application and kafka log4j appender issue

阅读更多关于 spark streaming application and kafka log4j appender issue

问题 I am testing my spark streaming application, and I have multiple functions in my code: - some of them operate on a DStream[RDD[XXX]], some of them on RDD[XXX] (after I do DStream.foreachRDD). I use Kafka log4j appender to log business cases that occur within my functions, that operate on both DStream[RDD] & RDD it self. But data gets appended to Kafka only when from functions that operate on RDD -> it doesn't work when I want to append data to kafka from my functions that operate on DStream.

Apache Zeppelin 0.6.1: Run Spark 2.0 Twitter Stream App

阅读更多关于 Apache Zeppelin 0.6.1: Run Spark 2.0 Twitter Stream App

问题 I have a cluster with Spark 2.0 and Zeppelin 0.6.1 installed. Since the class TwitterUtils.scala is moved from Spark project to Apache Bahir, I can't use the TwitterUtils in my Zeppelin notebook anymore. Here the snippets of my notebook: Dependency loading: %dep z.reset z.load("org.apache.bahir:spark-streaming-twitter_2.11:2.0.0") DepInterpreter(%dep) deprecated. Remove dependencies and repositories through GUI interpreter menu instead. DepInterpreter(%dep) deprecated. Load dependency through

filter the lines by two words Spark Streaming

阅读更多关于 filter the lines by two words Spark Streaming

问题 Is there a way to filter with one expression the lines containing a word "word1" or the other "word2" something like : val res = lines.filter(line => line.contains("word1" or "word2")) because this expression doesn't work. Thank you in advance 回答1: If line is a String optimal choice would regexp: val pattern = "word1|word2".r lines.filter(line => pattern.findFirstIn(line).isDefined) otherwise (other sequence type) you can use Seq.exists: lines.filter(line => Seq("foo", "bar").exists(s => line

Unresolved dependencies path for SBT project in IntelliJ

阅读更多关于 Unresolved dependencies path for SBT project in IntelliJ

问题 I'm using IntelliJ to develop Spark application. I'm following this instruction on how to make intellij work nicely with SBT project. As my whole team is using IntelliJ so we can just modify build.sbt but we got this unresolved dependencies error Error:Error while importing SBT project: [info] Resolving org.apache.thrift#libfb303;0.9.2 ... [info] Resolving org.apache.spark#spark-streaming_2.10;2.1.0 ... [info] Resolving org.apache.spark#spark-streaming_2.10;2.1.0 ... [info] Resolving org

Apache Phoenix (4.3.1 and 4.4.0-HBase-0.98) on Spark 1.3.1 ClassNotFoundException

阅读更多关于 Apache Phoenix (4.3.1 and 4.4.0-HBase-0.98) on Spark 1.3.1 ClassNotFoundException

问题 I'm trying to connect to Phoenix via Spark and I keep getting the following exception when opening a connection via the JDBC driver (cut for brevity, full stacktrace below): Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.ipc.controller.ClientRpcControllerFactory at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) The class in question is provided

How to update rdd periodically in spark streaming

阅读更多关于 How to update rdd periodically in spark streaming

问题 My code is something like: sc = SparkContext() ssc = StreamingContext(sc, 30) initRDD = sc.parallelize('path_to_data') lines = ssc.socketTextStream('localhost', 9999) res = lines.transform(lambda x: x.join(initRDD)) res.pprint() And my question is that initRDD need to be updated every day in midnight . I try to this way: sc = SparkContext() ssc = StreamingContext(sc, 30) lines = ssc.socketTextStream('localhost', 9999) def func(rdd): initRDD = rdd.context.parallelize('path_to_data') return rdd

How to read .csv file using spark-shell

阅读更多关于 How to read .csv file using spark-shell

问题 I am using a spark standalone with hadoop prebuilt. I was wondering what library I should import in order to let me read a .csv file? I found one library from github: https://github.com/tototoshi/scala-csv But when I typed import com.github.tototoshi.csv._ as illustrated in readme, it doesn't work. Should I do something else before importing it maybe something like buiding it using sbt first? I tried to build using sbt and it doesn't work either (what I did is following the step in the last

Null Pointer Exception When Trying to Use Persisted Table in Spark Streaming

阅读更多关于 Null Pointer Exception When Trying to Use Persisted Table in Spark Streaming

问题 I am creating "gpsLookUpTable" at the beginning and persisting it so that i do not need to pull it over and over again to do mapping. However, when i try to access it inside foreach i get null pointer exception. Any help is appreciated thanks. Below is code snippets: def main(args: Array[String]): Unit = { val conf = new SparkConf() ... val sc = new SparkContext(conf) val ssc = new StreamingContext(sc, Seconds(20)) val sqc = new SQLContext(sc) //////Trying to cache table here to use it below