问题
I am building a Spark structured streaming job that does the below,
Streaming source,
val small_df = spark.readStream
.format("kafka")
.load()
small_df.createOrReplaceTempView("small_df")
A dataframe - Phoenix load
val phoenixDF = spark.read.format("org.apache.phoenix.spark")
.option("table", "my_table")
.option("zkUrl", "zk")
.load()
phoenixDF.createOrReplaceTempView("phoenix_tbl")
Then, spark sql statement to join(on primary_key) with another small dataframe to filter records.
val filteredDF = spark.sql("select phoenix_data.* from small_df join phoenix_tbl on small_df.id = phoenix_tbl.id)
Observations:
Spark does full table scan and range scan for joins
and filter
respectively
Since small_df
is streaming dataset I couldn't use filter
and relying on join
to filter records from phoenix table but ended up with full table scan which is not feasible.
More details on requirement
How can I perform range scan in this case?
I am doing similar to the one discussed here but the only difference is my small_df
is a streaming dataset.
来源:https://stackoverflow.com/questions/62746964/spark-structured-streaming-filter-phoenix-table-by-streaming-dataset