Not able to write Data in Parquet File using Spark Structured Streaming

后端 未结 2 1483
不思量自难忘°
不思量自难忘° 2021-02-06 15:14

I have a Spark Structured Streaming:

val df = spark
      .readStream
      .format(\"kafka\")
      .option(\"kafka.bootstrap.servers\", \"localhost:9092\")
            


        
2条回答
  •  小鲜肉
    小鲜肉 (楼主)
    2021-02-06 15:45

    I resolved this issue. Actually when I tried to run the Structured Streaming on spark-shell, then it gave an error that endingOffsets are not valid in streaming queries, i.e.,:

    val df = spark
          .readStream
          .format("kafka")
          .option("kafka.bootstrap.servers", "localhost:9092")
          .option("startingOffsets", "earliest")
          .option("endingOffsets", "latest")
          .option("subscribe", "topic")
          .load()
    
    
    java.lang.IllegalArgumentException: ending offset not valid in streaming queries
      at org.apache.spark.sql.kafka010.KafkaSourceProvider$$anonfun$validateStreamOptions$1.apply(KafkaSourceProvider.scala:374)
      at org.apache.spark.sql.kafka010.KafkaSourceProvider$$anonfun$validateStreamOptions$1.apply(KafkaSourceProvider.scala:373)
      at scala.Option.map(Option.scala:146)
      at org.apache.spark.sql.kafka010.KafkaSourceProvider.validateStreamOptions(KafkaSourceProvider.scala:373)
      at org.apache.spark.sql.kafka010.KafkaSourceProvider.sourceSchema(KafkaSourceProvider.scala:60)
      at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:199)
      at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:87)
      at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:87)
      at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
      at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:124)
      ... 48 elided
    

    So, I removed endingOffsets from streaming query.

    val df = spark
          .readStream
          .format("kafka")
          .option("kafka.bootstrap.servers", "localhost:9092")
          .option("startingOffsets", "earliest")
          .option("subscribe", "topic")
          .load()
    

    Then I tried to save streaming queries' result in Parquet files, during which I came to know that - checkpoint location must be specified, i.e.,:

    val query = df
              .writeStream
              .outputMode("append")
              .format("parquet")
              .start("data")
    
    org.apache.spark.sql.AnalysisException: checkpointLocation must be specified either through option("checkpointLocation", ...) or SparkSession.conf.set("spark.sql.streaming.checkpointLocation", ...);
      at org.apache.spark.sql.streaming.StreamingQueryManager$$anonfun$3.apply(StreamingQueryManager.scala:207)
      at org.apache.spark.sql.streaming.StreamingQueryManager$$anonfun$3.apply(StreamingQueryManager.scala:204)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:203)
      at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269)
      at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262)
      at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:206)
      ... 48 elided
    

    So, I added checkPointLocation:

    val query = df
              .writeStream
              .outputMode("append")
              .format("parquet")
              .option("checkpointLocation", "checkpoint")
              .start("data")
    

    After doing these modifications, I was able to save streaming queries' results in Parquet files.

    But, it is strange that when I ran the same code via sbt application, it didn't threw any errors, but when I ran the same code via spark-shell these errors were thrown. I think Apache Spark should throw these errors when run via sbt/maven app too. It is seems to be a bug to me !

提交回复
热议问题