Spark structured streaming - Filter Phoenix table by streaming dataset

青春壹個敷衍的年華 提交于 2020-08-10 19:04:06

问题


I am building a Spark structured streaming job that does the below,

Streaming source,

val small_df = spark.readStream
  .format("kafka")
  .load()

small_df.createOrReplaceTempView("small_df")

A dataframe - Phoenix load

val phoenixDF = spark.read.format("org.apache.phoenix.spark")
  .option("table", "my_table")
  .option("zkUrl", "zk")
  .load()

phoenixDF.createOrReplaceTempView("phoenix_tbl")

Then, spark sql statement to join(on primary_key) with another small dataframe to filter records.

val filteredDF = spark.sql("select phoenix_data.* from small_df join phoenix_tbl on small_df.id = phoenix_tbl.id)

Observations:

Spark does full table scan and range scan for joins and filter respectively

Since small_df is streaming dataset I couldn't use filter and relying on join to filter records from phoenix table but ended up with full table scan which is not feasible.

More details on requirement

How can I perform range scan in this case?

I am doing similar to the one discussed here but the only difference is my small_df is a streaming dataset.

来源:https://stackoverflow.com/questions/62746964/spark-structured-streaming-filter-phoenix-table-by-streaming-dataset

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!