Can spark streaming pick specific files

梦想的初衷 提交于 2019-12-11 10:43:44

问题


My program continuously read streams from a hadoop folder(say /hadoopPath/) .Its picking all the files from the above folder . Can I pic only specific file types for this folder ( like :/hadoopPath/*.log)

I have another question related to Spark and streaming : Is spark streaming works with both "cp" and "mv"


回答1:


I've been struggling with the same problem for a couple of hours and although it seemed so easy, I could not find anything online about it. Finally, I found a solution that worked in my case. I am putting it here to save some time for others with the same issue.
Consider you only want to read the files with the pattern "path-to-hadoop-folder/*.csv". In the default case, when you indicate the folder spark reads all the files in the folder (e.g. .csv.COPYING) which in my case, resulted in an error. All you need to do is to specify this pattern in your .csv method when defining the readStrem. An example in python would be like this:

activity = spark \
    .readStream \ 
    .option("sep", ",") \ 
    .schema(userSchema) \ 
    .csv("path-to-hadoop-folder/*.csv")  

In this way spark only considers files with *.csv pattern and ignores all other files that are in the folder. I have tested it on spark 2.0.0 and hadoop 2.6. (P.S I have only tested it for csv files but I guess working with text files should have a similar solution) you can find the same solution in spark dataStreamReader guild



来源:https://stackoverflow.com/questions/36351457/can-spark-streaming-pick-specific-files

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!