In Spark Streaming how to process old data and delete processed Data

别等时光非礼了梦想. 提交于 2019-12-11 10:35:24

问题


We are running a Spark streaming job that retrieves files from a directory (using textFileStream). One concern we are having is the case where the job is down but files are still being added to the directory. Once the job starts up again, those files are not being picked up (since they are not new or changed while the job is running) but we would like them to be processed.

1) Is there a solution for that? Is there a way to keep track what files have been processed and can we "force" older files to be picked up?

2) Is there a way to delete the processed files?


回答1:


The article below pretty much covers all your questions.

https://blog.yanchen.ca/2016/06/28/fileinputdstream-in-spark-streaming/

1) Is there a solution for that? Is there a way to keep track what files have been processed and can we "force" older files to be picked up?

Stream reader initiates batch window using the system clock when a job/application is launched. Apparently all the files created before would be ignored. Try enabling checkpointing.

2) Is there a way to delete the processed files?

Deleting files might be unnecessary. If checkpointing works, the files not being processed are identified by Spark. If for some reason the files are to be deleted, implement a custom input format and reader (please refer article) to capture the file name and use this information as appropriate. But I wouldn't recommend this approach.



来源:https://stackoverflow.com/questions/47677772/in-spark-streaming-how-to-process-old-data-and-delete-processed-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!