Get HDFS file path in PySpark for files in sequence file format
问题 My data on HDFS is in Sequence file format. I am using PySpark (Spark 1.6) and trying to achieve 2 things: Data path contains a timestamp in yyyy/mm/dd/hh format that I would like to bring into the data itself. I tried SparkContext.wholeTextFiles but I think that might not support Sequence file format. How do I deal with the point above if I want to crunch data for a day and want to bring in the date into the data? In this case I would be loading data like yyyy/mm/dd/* format. Appreciate any