spark-streaming

PySpark Processing Stream data and saving processed data to file

坚强是说给别人听的谎言 提交于 2019-12-25 08:04:31
问题 I am trying to replicate a device that is streaming it's location's coordinates, then process the data and save it to a text file. I am using Kafka and Spark streaming (on pyspark),this is my architecture: 1-Kafka producer emits data to a topic named test in the following string format : "LG float LT float" example : LG 8100.25191107 LT 8406.43141483 Producer code : from kafka import KafkaProducer import random producer = KafkaProducer(bootstrap_servers='localhost:9092') for i in range(0

spark-submit ClassNotFound Exception with Maven

我只是一个虾纸丫 提交于 2019-12-25 08:02:35
问题 I realize there are related questions with this one, but I just can't get my code to work. I am running a Spark Streaming application in standalone mode, with the master node in my Windows host and a worker in an Ubuntu virtual machine. Here is the problem: when I run spark-submit, this is what shows up: >spark-submit --master spark://192.168.56.1:7077 --class spark.example.Main C:/Users/Manuel Mourato/xxx/target/ParkMonitor-1.0-SNAPSHOT.jar Warning: Skip remote jar C:/Users/Manuel. java.lang

Cartesian product of two DStream in Spark

笑着哭i 提交于 2019-12-25 07:27:16
问题 How I can product two DStream in apache streaming like cartesian(RDD<U>) which when called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements). One solution is using join as follow that doesn't seem good. JavaPairDStream<Integer, String> xx = DStream_A.mapToPair(s -> { return new Tuple2<>(1, s); }); JavaPairDStream<Integer, String> yy = DStream_B.mapToPair(e -> { return new Tuple2<>(1, e); }); DStream_A_product_B = xx.join(yy); Is there any better solution?

Spark Streaming: java.io.FileNotFoundException: File does not exist: <input_filename>._COPYING_

半世苍凉 提交于 2019-12-25 07:07:33
问题 I am writing a spark streaming application which reads input from HDFS. I submit spark application to yarn and then run a script which copies data from local fs to HDFS. But Spark application starts throwing fileNotFoundException. I believe this is happening because spark is picking up files before it is being copied fully onto HDFS. Following is the some part of exception trace: java.io.FileNotFoundException: File does not exist: <filename>._COPYING_ at org.apache.hadoop.hdfs.server.namenode

How implement LEFT or RIGHT JOIN using spark-cassandra-connector

不羁的心 提交于 2019-12-25 06:36:35
问题 I have spark streaming job. I am using Cassandra as datastore. I have stream which is need to be joined with cassandra table. I am using spark-cassandra-connector, there is great method joinWithCassandraTable which is as far as I can understand implementing inner join with cassandra table val source: DStream[...] = ... source.foreachRDD { rdd => rdd.joinWithCassandraTable( "keyspace", "table" ).map{ ... } } So the question is how can I implement left outer join with cassandra table? Thanks in

structured streaming different schema in nested json

风格不统一 提交于 2019-12-25 03:19:13
问题 Hi I have a scenario where the incoming message is a Json which has a header say tablename and the data part has the table column data. Now i want to write this to parquet to separate folders say /emp and /dept . I can achieve this in regular streaming by aggregating rows based on the tablname. But in structured streaming I am unable to split this. How can I achieve this in structured streaming. {"tableName":"employee","data":{"empid":1","empname":"john","dept":"CS"} {"tableName":"employee",

How do I consume Kafka topic inside spark streaming app?

ぐ巨炮叔叔 提交于 2019-12-25 01:14:04
问题 When I create a stream from Kafka topic and print its content import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 pyspark-shell' from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils sc = SparkContext(appName="PythonStreamingKafkaWords") ssc = StreamingContext(sc, 10) lines = KafkaUtils.createDirectStream(ssc, ['sample_topic'], {"bootstrap.servers": 'localhost

Unable to see messages from Kafka Stream in Spark

落爺英雄遲暮 提交于 2019-12-25 00:08:24
问题 I just started the testing of Kafka Stream to Spark using Pyspark library. I have been running the whole setup on Jupyter Notebook . I am trying to get data from the Twitter Streaming . Twitter Streaming Code: import json import tweepy from uuid import uuid4 import time from kafka import KafkaConsumer from kafka import KafkaProducer auth = tweepy.OAuthHandler("key", "key") auth.set_access_token("token", "token") api = tweepy.API(auth, wait_on_rate_limit=True, retry_count=3, retry_delay=5,

Connecting Spark Streaming to Tableau

☆樱花仙子☆ 提交于 2019-12-24 21:16:52
问题 I am streaming tweets from a Twitter app to Spark for analysis. I want to output the resulting Spark SQL table to Tableau for real-time analysis locally. I have already tried connecting to Databricks to run the program but I haven't been able to connect the Twitter app to Databricks notebook. My code for writing the steam looks like this: activityQuery = output.writeStream.trigger(processingTime='1 seconds').queryName("Places")\ .format("memory")\ .start() 来源: https://stackoverflow.com

Spark Program finding Popular HashTags from twiiter

ぐ巨炮叔叔 提交于 2019-12-24 19:04:47
问题 I am trying to run this spark program which will get me the popular hashtags currently on twitter and will only show the top 10 hashtags. I have supplied the twiiter access token, Secret & the Customer Key, Secret via a text File. import org.apache.spark.streaming.StreamingContext import org.apache.spark.streaming.Seconds import org.apache.spark.streaming.twitter.TwitterUtils object PopularHashtags { def setupLogging() = { import org.apache.log4j.{ Level, Logger } val rootLogger = Logger