Why does streaming query with Summarizer fail with “requirement failed: Nothing has been added to this summarizer”?

不羁岁月 提交于 2021-02-10 18:24:01

问题


This is a follow-up to How to generate summary statistics (using Summarizer.metrics) in streaming query?

I am running a python script to generate summary statistics of micro-batches of a streaming query.

Python code (I am currently running)

import sys
import json
import psycopg2
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import from_json, col, to_json
from pyspark.sql.types import *
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
from pyspark.sql.functions import get_json_object
from pyspark.ml.stat import Summarizer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler
from pyspark.sql.functions import lit,unix_timestamp
from pyspark.sql import functions as F
import numpy as np
from pyspark.mllib.stat import Statistics

spark = SparkSession.builder.appName("spark_streaming_sql").getOrCreate()
spark.sparkContext.setLogLevel("WARN")
spark.streams.active

data = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "<kafka_broker>:<port_number>").option("subscribe", "<topic_name>").option("startingOffsets", "latest").load()

schema = StructType([
    StructField("id", DoubleType()),
    StructField("timestamp_value", DoubleType()), 
    StructField("device_id", DoubleType()), 
    StructField("temperature_value", DoubleType()),
    StructField("comment", StringType())])

telemetry_dataframe = data.selectExpr("CAST(value AS STRING)").select(from_json(col("value").cast("string"), schema).alias("tmp")).select("tmp.*")

telemetry_dataframe.printSchema()

temperature_value_selection = telemetry_dataframe.select("temperature_value")

temperature_value_selection_new = temperature_value_selection.withColumn("device_temperature", temperature_value_selection["temperature_value"].cast(DecimalType()))

temperature_value_selection_new.printSchema()

assembler = VectorAssembler(
  inputCols=["device_temperature"], outputCol="temperatures"
)

assembled = assembler.transform(temperature_value_selection_new)

assembled_new = assembled.withColumn("timestamp", F.current_timestamp())

assembled_new.printSchema()

summarizer = Summarizer.metrics("max", "min", "variance", "mean", "count")

descriptive_table_one = assembled_new.select(summarizer.summary(assembled_new.temperatures))

# -------------------------------------------------------------------------------------

#########################################
#               QUERIES                 #
#########################################

query_4_1 = descriptive_table_one.writeStream.outputMode("complete").format("console").start().awaitTermination()

For a reason, I get the following exception:

java.lang.IllegalArgumentException: requirement failed: Nothing has been added to this summarizer.
        at scala.Predef$.require(Predef.scala:224)
        at org.apache.spark.ml.stat.SummaryBuilderImpl$SummarizerBuffer.max(Summarizer.scala:513)
        at org.apache.spark.ml.stat.SummaryBuilderImpl$MetricsAggregate$$anonfun$10.apply(Summarizer.scala:582)
        at org.apache.spark.ml.stat.SummaryBuilderImpl$MetricsAggregate$$anonfun$10.apply(Summarizer.scala:577)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.AbstractTraversable.map(Traversable.scala:104)
        at org.apache.spark.ml.stat.SummaryBuilderImpl$MetricsAggregate.eval(Summarizer.scala:577)
        at org.apache.spark.ml.stat.SummaryBuilderImpl$MetricsAggregate.eval(Summarizer.scala:567)
        at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.eval(interfaces.scala:543)
        at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$1.apply(AggregationIterator.scala:231)
        at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$1.apply(AggregationIterator.scala:223)
        at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.next(ObjectAggregationIterator.scala:86)
        at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.next(ObjectAggregationIterator.scala:33)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:118)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:116)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:146)
        at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:67)
        at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:66)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:123)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

And this causes the abortion of the whole spark job.

Also, keep a note that I generate random data based on a python script. So after the Spark streaming program starts then I also start the data generation. Thus, in Batch:0 I have not data to aggregate-show.

Kindly check my spark logs by following the below link. Spark-error logs

来源:https://stackoverflow.com/questions/59422139/why-does-streaming-query-with-summarizer-fail-with-requirement-failed-nothing

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!