问题
I have a CSV file of the following format that I am trying to wrangle with GCP dataprep.
Timestamp Tag Value
2018-05-01 09:00:00 Temperature 40.1
2018-05-01 09:00:00 Humidity 80
2018-05-01 09:05:00 Temperature 40.2
2018-05-01 09:05:00 Humidity 80
2018-05-01 09:10:00 Temperature 40.0
2018-05-01 09:10:00 Humidity 82
The data extends in 5 minutes interval for 2 weeks. I would like to transform it such that at each 10 minute interval, I am displaying the average(or min/max/median) of the previous 10 minutes and also pivot it,so that the end result I get is as follows:
Timestamp Temperature Humidity
2018-05-01 09:10:00 40.1 80.06
So essentially, the value at 09:10 is the average of the values at 09:00, 09:05 and 09:10.
And the value at 09:25 would be the average at 09:15,09:20,09:25.
I have tried window functions and aggregate, but seem to be unable to get it to work.
Thank you for your input!
回答1:
I would start by reformatting the data were each line is of the form:
Timestamp Temperature Humidity
2018-05-01 09:00:00 40.1 80
2018-05-01 09:05:00 40.2 80
2018-05-01 09:10:00 40 82
2018-05-01 09:15:00 41 81
2018-05-01 09:20:00 40.8 81
2018-05-01 09:25:00 40.2 84
This will reduce the size of your data in half and reduce the amount of storage space you are utilizing. You can use the ‘convert values to columns’ function in Dataprep. This will create a column for Temperature and one for Humidity.
Once in that format, you can use the window function ROLLINGAVERAGE to compute the averages. Make the function compute the average for the two rows which precedes it by setting the following parameters:
rollingaverage(Temperature, 2, 0)
The first argument is the column name, the second is the preceding rows, and the last one is the number of following rows on which the average is computed. Repeat the above function for the Humidity column. The results should resemble the following:
Next, under the filter rows option, use the ‘rows at regular intervals’ function to keep every third row starting from the third entry. Set the parameter as follows:
Remove the two original Temperature and Humidity rows and you will get the following:
Timestamp Temperature Humidity
2018-05-01 09:10:00 40.1 80.67
2018-05-01 09:25:00 41 82
来源:https://stackoverflow.com/questions/50807410/gcp-dataprep-moving-window