_spark_metadata causing problems

问题

I am using Spark with Scala and I have a directory where I have multiple files.

In this directory I have Parquet files generated by Spark and other files generated by Spark Streaming.

And Spark streaming generates a directory _spark_metadata.

The problem I am facing is when I read the directory with Spark (sparksession.read.load), it reads only the data generated by Spark streaming, like if the other data does not exist.

Does someone know how to resolve this issue, I think there should be a property to force Spark to ignore the spark_metadata directory.

Thank you for your help

回答1:

I have the same problem (Spark 2.4.0), and the only way I am aware of is to load the files using a mask/pattern, something like this

sparksession.read.format("parquet").load("/path/*.parquet")

As far as I know there is no way to ignore this directory. If it exists, Spark will consider it.

来源：https://stackoverflow.com/questions/53479585/spark-metadata-causing-problems

标签

scala

apache-spark

spark-streaming

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!