Azure Data Lake Analytics: Combine overlapping time duration using U-SQL

后端未结

关注

 1  1865

I want to remove overlapping time duration from CSV data placed in Azure Data Lake Store using U-SQL and combine those rows. Data set contains start time and end time with sever

相关标签:

1条回答

刺人心

2021-01-23 00:00
It looks like you want to aggregate all the data for the rows that provide overlapping timeframes? Or what do you want to do with the data in the other columns?

At first glance, I would suggest that you use a user-defined REDUCER or a user-defined aggregator, depending on what you want to achieve with the other data.

However, a problem I see is that you may need a fix point recursion to create the common overlapping ranges. Unfortunately, there is no fix point recursion in U-SQL (nor Hive) because scale out processing of recursion can't be done efficiently.

UPDATE AFTER CLARIFICATION:

That is easier I think. You just take the min of the beginning and the max of the end and group by the key value:
```
@r = EXTRACT begin DateTime, end DateTime,
             data string
     FROM "/temp/ranges.txt"
     USING Extractors.Text(delimiter:'-');

@r = SELECT MIN(begin) AS begin,
            MAX(end) AS end,
            data
     FROM @r
     GROUP BY data;

OUTPUT @r
TO "/temp/result.csv"
USING Outputters.Csv();
```
Note this works only if your ranges are on the same day and do not span over midnight.

UPDATED WITH A SOLUTION THAT HANDLES DISJOINT RANGES FOR A USER You can solve it with a user-defined reducer. The following blog post explains the details of the solution and provides links to the GitHub code: https://blogs.msdn.microsoft.com/mrys/2016/06/08/how-do-i-combine-overlapping-ranges-using-u-sql-introducing-u-sql-reducer-udos/
0 讨论(0)
发布评论:

提交评论
- 加载中...