Pyspark Structured streaming processing

后端 未结 1 1809
萌比男神i
萌比男神i 2021-01-28 04:10

I am trying to make a structured streaming application with spark the main idea is to read from a kafka source, process the input, write back to another topic. i have successful

1条回答
  •  滥情空心
    2021-01-28 04:44

    Processing the data before writing into Kafka sink in Pyspark based Structured Streaming API,we can easily handle with UDF function for any kind of complex transformation .

    example code is in below . This code is trying to read the JSON format message Kafka topic and parsing the message to convert the message from JSON into CSV format and rewrite into another topic. You can handle any processing transformation in place of 'json_formatted' function .

    from pyspark.sql import SparkSession
    from pyspark.sql.functions import *
    from pyspark.sql.types import *
    from pyspark.streaming import StreamingContext
    from pyspark.sql.column import Column, _to_java_column
    from pyspark.sql.functions import col, struct
    from pyspark.sql.functions import udf
    import json
    import csv
    import time
    import os
    
    #  Spark Streaming context :
    
    spark = SparkSession.builder.appName('pda_inst_monitor_status_update').getOrCreate()
    sc = spark.sparkContext
    ssc = StreamingContext(sc, 20)
    
    
    #  Creating  readstream DataFrame :
    
    df = spark \
      .readStream \
      .format("kafka") \
      .option("kafka.bootstrap.servers", "localhost:9092") \
      .option("subscribe", "KafkaStreamingSource") \
      .load()
    
    df1 = df.selectExpr( "CAST(value AS STRING)")
    
    df1.registerTempTable("test")
    
    
    def json_formatted(s):
        val_dict = json.loads(s)
        return str([
                        val_dict["after"]["ID"]
                    ,   val_dict["after"]["INST_NAME"]
                    ,   val_dict["after"]["DB_UNIQUE_NAME"]
                    ,   val_dict["after"]["DBNAME"]
                    ,   val_dict["after"]["MON_START_TIME"]
                    ,   val_dict["after"]["MON_END_TIME"]
                    ]).strip('[]').replace("'","").replace('"','')
    
    spark.udf.register("JsonformatterWithPython", json_formatted)
    
    squared_udf = udf(json_formatted)
    df1 = spark.table("test")
    df2 = df1.select(squared_udf("value"))
    
    
    
    #  Declaring the Readstream Schema DataFrame :
    
    df2.coalesce(1).writeStream \
       .writeStream \
       .outputMode("update") \
       .format("kafka") \
       .option("kafka.bootstrap.servers", "localhost:9092") \
       .option("topic", "StreamSink") \
       .option("checkpointLocation", "./testdir")\
       .start()
    
    ssc.awaitTermination()
    

    0 讨论(0)
提交回复
热议问题