Using merge to combine matching records

问题

I'm trying to combine matching records from a table into a single record of another table. I know this can be done with group by, and sum(), max(), etc..., My difficulty is that the columns that are not part of the group by are varchars that i need to concatenate.

I'm using Sybase ASE 15, so I do not have a function like MySQL's group_concat or similar.

I tried using merge without luck, the target table ended with the same number of records of source table.

create table #source_t(account varchar(10), event varchar(10))

Insert into #source_t(account, event) values ('account1','event 1')
Insert into #source_t(account, event) values ('account1','event 2')
Insert into #source_t(account, event) values ('account1','event 3')

create table #target(account varchar(10), event_list varchar(2048))

merge into #target as t
    using #source_t as s
    on t.account = s.account
    when     matched then update set event_list = t.event_list + ' | ' + s.event
    when not matched then insert(account, event_list) values (s.account, s.event)

select * from #target

drop table #target

drop table #source_t

Considering the above tables, I wanted to have one record per account, with all the events of the account concatenated in the second column.

account, event_list
'account1', 'event 1 | event 2 | event 3'

However, all I've got is the same records as #source.

It seems to me that the match in merge is attempted against the "state" of the table at the beginning of statement execution, so the when matched never executes. Is there a way of telling the DBMS to match against the updated target table?

I managed to obtain the results I needed by using a cursor, so the merge statement is executed n times, n being the number of records in #source, thus the merge actually executes the when matched part.

The problem with it is the performance, removing duplicates this way takes about 5 minutes to combine 63K records into 42K.

Is there a faster way of achieving this?

回答1:

There's a little known (poorly documented?) aspect of the UPDATE statement when using it to update a @variable which allows you to accumulate/concatenate values in the @variable as part of a set-based UPDATE operation.

This is easier to 'explain' with an example:

create table source
(account  varchar(10)
,event    varchar(10)
)
go

insert source values ('account1','event 1')
insert source values ('account1','event 2')
insert source values ('account1','event 3')

insert source values ('account2','event 1')

insert source values ('account3','event 1')
insert source values ('account3','event 2')
go

declare @account      varchar(10),
        @event_list   varchar(40)   -- increase the size to your expected max length 

select  @account = 'account1'

-- allow our UPDATE statement to cycle through the events for 'account1',
-- appending each successive event to @event_list

update  source
set     @event_list = @event_list + 
                      case when @event_list is not NULL then ' | ' end + 
                      event
from    source
where   account = @account

-- we'll display as a single-row result set; we could also use a 'print' statement ... 
-- just depends on what format the calling process is looking for

select  @account     as account,
        @event_list  as event_list
go

 account    event_list
 ---------- ----------------------------------------
 account1   event 1 | event 2 | event 3

PRO:

single UPDATE statement to process a single account value

CON:

still need a cursor to process a series of account values
if your desired final output is a single result set then you'll need to store intermediate results (eg, @account and @update) in a (temp) table, then run a final SELECT against this (temp) table to produce the desired result set
while you're not actually updating the physical table, you may run into problems if you don't have access to 'update' the table

NOTE: You could put the cursor/UPDATE logic in a stored proc, call the proc through a proxy table, and this would allow the output from a series of 'select @account,@update' statements to be returned to the calling process as a single result set ... but that's a whole 'nother topic on a (somewhat) convoluted coding method.

For your process you'll need a cursor to loop through your unique set of account values, but you'll be able to eliminate the cursor overhead for looping through the list of events for a given account. Net result is that you should see some improvement in the time it takes to run your process.

回答2:

After applying the given suggestions, and also speaking with our DBA, the winner idea was to ditch the merge and use logical conditions over the loop.

Adding begin/commit seemed to reduce execution time by 1.5 to 3 seconds.

Adding a primary key to the target table did the best reduction, reducing execution time to about 13 seconds.

Converting the merge to conditional logic was the best option in this case, achieving the result in about 8 seconds.

When using conditionals, the primary key in target table is detrimental by a small amount (around 1 sec), but having it drastically reduces time afterwards, since this table is only a previous step for a big join. (That is, the result of this record-merging is latter used in a join with 11+ tables.) So I kept the P.K.

Since there seems to be no solution without a cursor loop, I used the conditionals to merge the values using variables and issuing only the inserts to the target table, thus eliminating the need to seek a record to update or to check its existence.

Here is a simplified example.

create table #source_t(account varchar(10), event varchar(10));

Insert into #source_t(account, event) values ('account1','event 1');
Insert into #source_t(account, event) values ('account1','event 2');
Insert into #source_t(account, event) values ('account1','event 3');

Insert into #source_t(account, event) values ('account2','came');
Insert into #source_t(account, event) values ('account2','saw');
Insert into #source_t(account, event) values ('account2','conquered');

create table #target(
    account varchar(10), -- make primary key if the result is to be joined afterwards.
    event_list varchar(2048)
);

declare ciclo cursor for
select account, event
from #source_t c
order by account --,...
for read only;

declare @account varchar(10), @event varchar(40), @last_account varchar(10), @event_list varchar(1000)

open ciclo

fetch ciclo into @account, @event

set @last_account = @account, @event_list = null

begin tran

    while @@sqlstatus = 0 BEGIN 

        if @last_account <> @account begin  -- if current record's account is different from previous, insert into table the concatenated event string  
            insert into #target(account, event_list) values (@last_account, @event_list)        
            set @event_list = null -- Empty the string for the next account
        end

        set @last_account = @account -- Copy current account to the variable that holds the previous one
        set @event_list = case @event_list when null then @event else @event_list + ' | ' + @event end -- Concatenate events with separator

        fetch ciclo into @account, @event
    END

    -- after the last fetch, @@sqlstatus changes to <> 0, the values remain in the variables but the loop ends, leaving the last record unprocessed.
    insert into #target(account, event_list) values (@last_account, @event_list)

commit tran

close ciclo

deallocate cursor ciclo;

select * from #target;

drop table #target;

drop table #source_t;

Result:

account |event_list                 |
--------|---------------------------|
account1|event 1 | event 2 | event 3|
account2|saw | came | conquered     |

This code worked fast enough in my real use case. However it could be further optimized by filtering the source table to hold only the values que would be necessary for the join afterward. For that matter I saved the final joined resultset (minus the join with #target) in another temp table, leaving some columns blank. Then #source_t was filled using only the accounts present in the resultset, processed it to #target and finally used #target to update the final result. With all that, execution time for production environment was dropped to around 8 seconds (including all steps).

UDF Solutions have solved this kind of problem for me before but in ASE 15 they have to be table-specific and need to write one function per column. Also, that is only possible in development environment, not authorized for production because of read only privileges.

In conclusion, A cursor loop combined with a merge statement is a simple solution for combining records using concatenation of certain values. A primary key or an index including the columns used for the match is required to boost performance.

Conditional logic results in even better performance but comes at the penalty of more complex code (the more you code, the more prone to error).

来源：https://stackoverflow.com/questions/56812800/using-merge-to-combine-matching-records

标签

sql

merge

sybase-ase