GCP DataPrep- moving window

此生再无相见时 提交于 2019-12-24 20:30:27

问题


I have a CSV file of the following format that I am trying to wrangle with GCP dataprep.

   Timestamp                 Tag           Value
   2018-05-01 09:00:00     Temperature     40.1
   2018-05-01 09:00:00     Humidity        80
   2018-05-01 09:05:00     Temperature     40.2
   2018-05-01 09:05:00     Humidity        80
   2018-05-01 09:10:00     Temperature     40.0
   2018-05-01 09:10:00     Humidity        82

The data extends in 5 minutes interval for 2 weeks. I would like to transform it such that at each 10 minute interval, I am displaying the average(or min/max/median) of the previous 10 minutes and also pivot it,so that the end result I get is as follows:

  Timestamp             Temperature    Humidity
 2018-05-01 09:10:00      40.1         80.06

So essentially, the value at 09:10 is the average of the values at 09:00, 09:05 and 09:10.

And the value at 09:25 would be the average at 09:15,09:20,09:25.

I have tried window functions and aggregate, but seem to be unable to get it to work.

Thank you for your input!


回答1:


I would start by reformatting the data were each line is of the form:

  Timestamp                Temperature      Humidity 
  2018-05-01 09:00:00      40.1             80
  2018-05-01 09:05:00      40.2             80
  2018-05-01 09:10:00      40               82
  2018-05-01 09:15:00      41               81
  2018-05-01 09:20:00      40.8             81
  2018-05-01 09:25:00      40.2             84

This will reduce the size of your data in half and reduce the amount of storage space you are utilizing. You can use the ‘convert values to columns’ function in Dataprep. This will create a column for Temperature and one for Humidity.

Once in that format, you can use the window function ROLLINGAVERAGE to compute the averages. Make the function compute the average for the two rows which precedes it by setting the following parameters:

rollingaverage(Temperature, 2, 0)

The first argument is the column name, the second is the preceding rows, and the last one is the number of following rows on which the average is computed. Repeat the above function for the Humidity column. The results should resemble the following:

Next, under the filter rows option, use the ‘rows at regular intervals’ function to keep every third row starting from the third entry. Set the parameter as follows:

Remove the two original Temperature and Humidity rows and you will get the following:

  Timestamp                Temperature      Humidity 
  2018-05-01 09:10:00      40.1             80.67
  2018-05-01 09:25:00      41               82


来源:https://stackoverflow.com/questions/50807410/gcp-dataprep-moving-window

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!