问题
I am using Spark with Scala and I have a directory where I have multiple files.
In this directory I have Parquet files generated by Spark and other files generated by Spark Streaming.
And Spark streaming generates a directory _spark_metadata.
The problem I am facing is when I read the directory with Spark (sparksession.read.load
), it reads only the data generated by Spark streaming, like if the other data does not exist.
Does someone know how to resolve this issue, I think there should be a property to force Spark to ignore the spark_metadata directory.
Thank you for your help
回答1:
I have the same problem (Spark 2.4.0), and the only way I am aware of is to load the files using a mask/pattern, something like this
sparksession.read.format("parquet").load("/path/*.parquet")
As far as I know there is no way to ignore this directory. If it exists, Spark will consider it.
来源:https://stackoverflow.com/questions/53479585/spark-metadata-causing-problems