I want to remove overlapping time duration from CSV data placed in Azure Data Lake Store using U-SQL and combine those rows. Data set contains start time and end time with sever
It looks like you want to aggregate all the data for the rows that provide overlapping timeframes? Or what do you want to do with the data in the other columns?
At first glance, I would suggest that you use a user-defined REDUCER or a user-defined aggregator, depending on what you want to achieve with the other data.
However, a problem I see is that you may need a fix point recursion to create the common overlapping ranges. Unfortunately, there is no fix point recursion in U-SQL (nor Hive) because scale out processing of recursion can't be done efficiently.
UPDATE AFTER CLARIFICATION:
That is easier I think. You just take the min of the beginning and the max of the end and group by the key value:
@r = EXTRACT begin DateTime, end DateTime,
data string
FROM "/temp/ranges.txt"
USING Extractors.Text(delimiter:'-');
@r = SELECT MIN(begin) AS begin,
MAX(end) AS end,
data
FROM @r
GROUP BY data;
OUTPUT @r
TO "/temp/result.csv"
USING Outputters.Csv();
Note this works only if your ranges are on the same day and do not span over midnight.
UPDATED WITH A SOLUTION THAT HANDLES DISJOINT RANGES FOR A USER You can solve it with a user-defined reducer. The following blog post explains the details of the solution and provides links to the GitHub code: https://blogs.msdn.microsoft.com/mrys/2016/06/08/how-do-i-combine-overlapping-ranges-using-u-sql-introducing-u-sql-reducer-udos/