PySpark Processing Stream data and saving processed data to file

问题

I am trying to replicate a device that is streaming it's location's coordinates, then process the data and save it to a text file. I am using Kafka and Spark streaming (on pyspark),this is my architecture:

1-Kafka producer emits data to a topic named test in the following string format :

"LG float LT float" example : LG 8100.25191107 LT 8406.43141483

Producer code :

from kafka import KafkaProducer
import random

producer = KafkaProducer(bootstrap_servers='localhost:9092')

for i in range(0,10000):
    lg_value = str(random.uniform(5000, 10000))
    lt_value = str(random.uniform(5000, 10000))
producer.send('test', 'LG '+lg_value+' LT '+lt_value)

producer.flush()

The producer works fine and i get the streamed data in the consumer(and even in spark)

2- Spark streaming is receiving this stream,i can even pprint() it

Spark streaming processing code

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

ssc = StreamingContext(sc, 1)
kvs = KafkaUtils.createDirectStream(ssc, ["test"], {"bootstrap.servers": "localhost:9092"})

lines = kvs.map(lambda x: x[1])

words      = lines.flatMap(lambda line: line.split(" "))
words.pprint()
word_pairs = words.map(lambda word: (word, 1))
counts     = word_pairs.reduceByKey(lambda a, b: a+b)
results    = counts.foreachRDD(lambda word: word.saveAsTextFile("C:\path\spark_test.txt"))
//I tried this kvs.saveAsTextFiles('C:\path\spark_test.txt')
// to copy all stream and it works fine
ssc.start()
ssc.awaitTermination()

As an error i get :

16/12/26 00:51:53 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Python worker did not connect back in time

And other exceptions.

What i actually want is to save each entry "LG float LT float" as a JSON format in a file,but first i want to simply save the coordinates in a file,i cant seem to make that happen.Any ideas?

I can provide with the full stack trace if needed

回答1:

I solved this like this, so i made a function to save each RDD, in the file ,this is the code that solved my problem :

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

ssc = StreamingContext(sc, 1)
kvs = KafkaUtils.createDirectStream(ssc, ["test"], {"bootstrap.servers": "localhost:9092"})

lines = kvs.map(lambda x: x[1])

coords      = lines.map(lambda line: line)

def saveCoord(rdd):
    rdd.foreach(lambda rec: open("C:\path\spark_test.txt", "a").write(
        "{"+rec.split(" ")[0]+":"+rec.split(" ")[1]+","+rec.split(" ")[2]+":"+rec.split(" ")[3]+"},\n"))
coords.foreachRDD(saveCoord)
coords.pprint()

ssc.start()
ssc.awaitTermination()

来源：https://stackoverflow.com/questions/41325355/pyspark-processing-stream-data-and-saving-processed-data-to-file

标签

python-2.7

apache-spark

pyspark

spark-streaming

kafka-python