Not able to write Data in Parquet File using Spark Structured Streaming

后端 未结 2 1478
不思量自难忘°
不思量自难忘° 2021-02-06 15:14

I have a Spark Structured Streaming:

val df = spark
      .readStream
      .format(\"kafka\")
      .option(\"kafka.bootstrap.servers\", \"localhost:9092\")
            


        
相关标签:
2条回答
  • 2021-02-06 15:28

    I had a similar problem but for different reasons, posting here in case someone has the same issue. When writing your output stream to file in append mode with watermarking, structured streaming has an interesting behavior where it won't actually write any data until a time bucket is older than the watermark time. If you're testing structured streaming and have an hour long water mark, you won't see any output for at least an hour.

    0 讨论(0)
  • 2021-02-06 15:45

    I resolved this issue. Actually when I tried to run the Structured Streaming on spark-shell, then it gave an error that endingOffsets are not valid in streaming queries, i.e.,:

    val df = spark
          .readStream
          .format("kafka")
          .option("kafka.bootstrap.servers", "localhost:9092")
          .option("startingOffsets", "earliest")
          .option("endingOffsets", "latest")
          .option("subscribe", "topic")
          .load()
    
    
    java.lang.IllegalArgumentException: ending offset not valid in streaming queries
      at org.apache.spark.sql.kafka010.KafkaSourceProvider$$anonfun$validateStreamOptions$1.apply(KafkaSourceProvider.scala:374)
      at org.apache.spark.sql.kafka010.KafkaSourceProvider$$anonfun$validateStreamOptions$1.apply(KafkaSourceProvider.scala:373)
      at scala.Option.map(Option.scala:146)
      at org.apache.spark.sql.kafka010.KafkaSourceProvider.validateStreamOptions(KafkaSourceProvider.scala:373)
      at org.apache.spark.sql.kafka010.KafkaSourceProvider.sourceSchema(KafkaSourceProvider.scala:60)
      at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:199)
      at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:87)
      at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:87)
      at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
      at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:124)
      ... 48 elided
    

    So, I removed endingOffsets from streaming query.

    val df = spark
          .readStream
          .format("kafka")
          .option("kafka.bootstrap.servers", "localhost:9092")
          .option("startingOffsets", "earliest")
          .option("subscribe", "topic")
          .load()
    

    Then I tried to save streaming queries' result in Parquet files, during which I came to know that - checkpoint location must be specified, i.e.,:

    val query = df
              .writeStream
              .outputMode("append")
              .format("parquet")
              .start("data")
    
    org.apache.spark.sql.AnalysisException: checkpointLocation must be specified either through option("checkpointLocation", ...) or SparkSession.conf.set("spark.sql.streaming.checkpointLocation", ...);
      at org.apache.spark.sql.streaming.StreamingQueryManager$$anonfun$3.apply(StreamingQueryManager.scala:207)
      at org.apache.spark.sql.streaming.StreamingQueryManager$$anonfun$3.apply(StreamingQueryManager.scala:204)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:203)
      at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269)
      at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262)
      at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:206)
      ... 48 elided
    

    So, I added checkPointLocation:

    val query = df
              .writeStream
              .outputMode("append")
              .format("parquet")
              .option("checkpointLocation", "checkpoint")
              .start("data")
    

    After doing these modifications, I was able to save streaming queries' results in Parquet files.

    But, it is strange that when I ran the same code via sbt application, it didn't threw any errors, but when I ran the same code via spark-shell these errors were thrown. I think Apache Spark should throw these errors when run via sbt/maven app too. It is seems to be a bug to me !

    0 讨论(0)
提交回复
热议问题