How to refresh a table and do it concurrently?

前端 未结 2 1722
难免孤独
难免孤独 2021-02-08 16:18

I\'m using Spark Streaming 2.1. I\'d like to refresh some cached table (loaded by spark provided DataSource like parquet, MySQL or user-defined data sources) periodically.

相关标签:
2条回答
  • 2021-02-08 17:02

    I had a problem to read a table from hive using a SparkSession specifically the method table, i.e. spark.table(table_name). Every time after wrote the table and try to read that I got this error:

    java.IO.FileNotFoundException ... The underlying files may have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

    I tried to refresh the table using spark.catalog.refreshTable(table_name) also sqlContext neither worked.

    My solutions as wrote the table and after read using:

    val usersDF = spark.read.load(s"/path/table_name")

    It's work fine.

    Is this a problem? Maybe the data at hdfs is not updated yet?

    0 讨论(0)
  • 2021-02-08 17:17

    In Spark 2.2.0 they have introduced feature of refreshing the metadata of a table if it was updated by hive or some external tools.

    You can achieve it by using the API,

    spark.catalog.refreshTable("my_table")
    

    This API will update the metadata for that table to keep it consistent.

    0 讨论(0)
提交回复
热议问题