spark-streaming | 易学教程

PySpark Processing Stream data and saving processed data to file

阅读更多关于 PySpark Processing Stream data and saving processed data to file

问题 I am trying to replicate a device that is streaming it's location's coordinates, then process the data and save it to a text file. I am using Kafka and Spark streaming (on pyspark),this is my architecture: 1-Kafka producer emits data to a topic named test in the following string format : "LG float LT float" example : LG 8100.25191107 LT 8406.43141483 Producer code : from kafka import KafkaProducer import random producer = KafkaProducer(bootstrap_servers='localhost:9092') for i in range(0

spark-submit ClassNotFound Exception with Maven

阅读更多关于 spark-submit ClassNotFound Exception with Maven

问题 I realize there are related questions with this one, but I just can't get my code to work. I am running a Spark Streaming application in standalone mode, with the master node in my Windows host and a worker in an Ubuntu virtual machine. Here is the problem: when I run spark-submit, this is what shows up: >spark-submit --master spark://192.168.56.1:7077 --class spark.example.Main C:/Users/Manuel Mourato/xxx/target/ParkMonitor-1.0-SNAPSHOT.jar Warning: Skip remote jar C:/Users/Manuel. java.lang

Cartesian product of two DStream in Spark

阅读更多关于 Cartesian product of two DStream in Spark

问题 How I can product two DStream in apache streaming like cartesian(RDD<U>) which when called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements). One solution is using join as follow that doesn't seem good. JavaPairDStream<Integer, String> xx = DStream_A.mapToPair(s -> { return new Tuple2<>(1, s); }); JavaPairDStream<Integer, String> yy = DStream_B.mapToPair(e -> { return new Tuple2<>(1, e); }); DStream_A_product_B = xx.join(yy); Is there any better solution?

Spark Streaming: java.io.FileNotFoundException: File does not exist: <input_filename>._COPYING_

阅读更多关于 Spark Streaming: java.io.FileNotFoundException: File does not exist: ._COPYING_

问题 I am writing a spark streaming application which reads input from HDFS. I submit spark application to yarn and then run a script which copies data from local fs to HDFS. But Spark application starts throwing fileNotFoundException. I believe this is happening because spark is picking up files before it is being copied fully onto HDFS. Following is the some part of exception trace: java.io.FileNotFoundException: File does not exist: <filename>._COPYING_ at org.apache.hadoop.hdfs.server.namenode

How implement LEFT or RIGHT JOIN using spark-cassandra-connector

阅读更多关于 How implement LEFT or RIGHT JOIN using spark-cassandra-connector

问题 I have spark streaming job. I am using Cassandra as datastore. I have stream which is need to be joined with cassandra table. I am using spark-cassandra-connector, there is great method joinWithCassandraTable which is as far as I can understand implementing inner join with cassandra table val source: DStream[...] = ... source.foreachRDD { rdd => rdd.joinWithCassandraTable( "keyspace", "table" ).map{ ... } } So the question is how can I implement left outer join with cassandra table? Thanks in

structured streaming different schema in nested json

阅读更多关于 structured streaming different schema in nested json

问题 Hi I have a scenario where the incoming message is a Json which has a header say tablename and the data part has the table column data. Now i want to write this to parquet to separate folders say /emp and /dept . I can achieve this in regular streaming by aggregating rows based on the tablname. But in structured streaming I am unable to split this. How can I achieve this in structured streaming. {"tableName":"employee","data":{"empid":1","empname":"john","dept":"CS"} {"tableName":"employee",

How do I consume Kafka topic inside spark streaming app?

阅读更多关于 How do I consume Kafka topic inside spark streaming app?

问题 When I create a stream from Kafka topic and print its content import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 pyspark-shell' from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils sc = SparkContext(appName="PythonStreamingKafkaWords") ssc = StreamingContext(sc, 10) lines = KafkaUtils.createDirectStream(ssc, ['sample_topic'], {"bootstrap.servers": 'localhost

Unable to see messages from Kafka Stream in Spark

阅读更多关于 Unable to see messages from Kafka Stream in Spark

问题 I just started the testing of Kafka Stream to Spark using Pyspark library. I have been running the whole setup on Jupyter Notebook . I am trying to get data from the Twitter Streaming . Twitter Streaming Code: import json import tweepy from uuid import uuid4 import time from kafka import KafkaConsumer from kafka import KafkaProducer auth = tweepy.OAuthHandler("key", "key") auth.set_access_token("token", "token") api = tweepy.API(auth, wait_on_rate_limit=True, retry_count=3, retry_delay=5,

Connecting Spark Streaming to Tableau

阅读更多关于 Connecting Spark Streaming to Tableau

问题 I am streaming tweets from a Twitter app to Spark for analysis. I want to output the resulting Spark SQL table to Tableau for real-time analysis locally. I have already tried connecting to Databricks to run the program but I haven't been able to connect the Twitter app to Databricks notebook. My code for writing the steam looks like this: activityQuery = output.writeStream.trigger(processingTime='1 seconds').queryName("Places")\ .format("memory")\ .start() 来源： https://stackoverflow.com

Spark Program finding Popular HashTags from twiiter

阅读更多关于 Spark Program finding Popular HashTags from twiiter

问题 I am trying to run this spark program which will get me the popular hashtags currently on twitter and will only show the top 10 hashtags. I have supplied the twiiter access token, Secret & the Customer Key, Secret via a text File. import org.apache.spark.streaming.StreamingContext import org.apache.spark.streaming.Seconds import org.apache.spark.streaming.twitter.TwitterUtils object PopularHashtags { def setupLogging() = { import org.apache.log4j.{ Level, Logger } val rootLogger = Logger