问题
My program continuously read streams from a hadoop
folder(say /hadoopPath/
) .Its picking all the files from the above folder . Can I pic only specific file types for this folder ( like :/hadoopPath/*.log
)
I have another question related to Spark and streaming : Is spark streaming works with both "cp" and "mv"
回答1:
I've been struggling with the same problem for a couple of hours and although it seemed so easy, I could not find anything online about it. Finally, I found a solution that worked in my case. I am putting it here to save some time for others with the same issue.
Consider you only want to read the files with the pattern "path-to-hadoop-folder/*.csv". In the default case, when you indicate the folder spark reads all the files in the folder (e.g. .csv.COPYING) which in my case, resulted in an error. All you need to do is to specify this pattern in your .csv method when defining the readStrem. An example in python would be like this:
activity = spark \
.readStream \
.option("sep", ",") \
.schema(userSchema) \
.csv("path-to-hadoop-folder/*.csv")
In this way spark only considers files with *.csv pattern and ignores all other files that are in the folder. I have tested it on spark 2.0.0 and hadoop 2.6. (P.S I have only tested it for csv files but I guess working with text files should have a similar solution) you can find the same solution in spark dataStreamReader guild
来源:https://stackoverflow.com/questions/36351457/can-spark-streaming-pick-specific-files