PySpark: read, map and reduce from multiline record textfile with newAPIHadoopFile
I'm trying so solve a problem that is kind of similar to this post . My original data is a text file that contains values (observations) of several sensors. Each observation is given with a timestamp but the sensor name is given only once, and not in each line. But there a several sensors in one file. Time MHist::852-YF-007 2016-05-10 00:00:00 0 2016-05-09 23:59:00 0 2016-05-09 23:58:00 0 2016-05-09 23:57:00 0 2016-05-09 23:56:00 0 2016-05-09 23:55:00 0 2016-05-09 23:54:00 0 2016-05-09 23:53:00 0 2016-05-09 23:52:00 0 2016-05-09 23:51:00 0 2016-05-09 23:50:00 0 2016-05-09 23:49:00 0 2016-05-09