B-trees, databases, sequential vs. random inserts, and speed. Random is winning

EDIT

@Remus corrected my test pattern. You can see the corrected version on his answer below.

I took the suggestion of replacing the INT with DECIMAL(29,0) and the results were:

Decimal: 2133
GUID: 1836

Random inserts are still winning, even with a fractionally bigger row.

Despite explanations that indicate random inserts are slower than sequential ones, these benchmarks show they are apparently faster. The explanations I'm getting aren't agreeing with the benchmarks. Therefore, my question remains focused on b-trees, sequential inserts, and speed.

...

I know from experience that b-trees have awful performance when data is added to them sequentially (regardless of the direction). However, when data is added randomly, best performance is obtained.

This is easy to demonstrate with the likes of an RB-Tree. Sequential writes cause a maximum number of tree balances to be performed.

I know very few databases use binary trees, but rather used n-order balanced trees. I logically assume they suffer a similar fate to binary trees when it comes to sequential inputs.

This sparked my curiosity.

If this is so, then one could deduce that writing sequential IDs (such as in IDENTITY(1,1)) would cause multiple re-balances of the tree to occur. I have seen many posts argue against GUIDs as "these will cause random writes". I never use GUIDs, but it struck me that this "bad" point was in fact a good point.

So I decided to test it. Here is my code:

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[T1](
    [ID] [int] NOT NULL
 CONSTRAINT [T1_1] PRIMARY KEY CLUSTERED ([ID] ASC) 
)
GO

CREATE TABLE [dbo].[T2](
    [ID] [uniqueidentifier] NOT NULL
 CONSTRAINT [T2_1] PRIMARY KEY CLUSTERED ([ID] ASC)
)

GO

declare @i int, @t1 datetime, @t2 datetime, @t3 datetime, @c char(300)

set @t1 = GETDATE()
set @i = 1

while @i < 2000 begin
    insert into T2 values (NEWID(), @c)
    set @i = @i + 1
end

set @t2 = GETDATE()
WAITFOR delay '0:0:10'
set @t3 = GETDATE()
set @i = 1

while @i < 2000 begin
    insert into T1 values (@i, @c)
    set @i = @i + 1
end

select DATEDIFF(ms, @t1, @t2) AS [Int], DATEDIFF(ms, @t3, getdate()) AS [GUID]

drop table T1
drop table T2

Note that I am not subtracting any time for the creation of the GUID nor for the considerably extra size of the row. The results on my machine were as follows:

Int: 17,340 ms GUID: 6,746 ms

This means that in this test, random inserts of 16 bytes was almost 3 times faster than sequential inserts of 4 bytes.

Would anyone like to comment on this?

Ps. I get that this isn't a question. It's an invite to discussion, and that is relevant to learning optimum programming.

flip the operation and the int is faster..have you taken into account log and data file growth? Run each separately

declare @i int, @t1 datetime, @t2 datetime

set @t1 = GETDATE()
set @i = 1

while @i < 10000 begin
    insert into T2 values (NEWID())
    set @i = @i + 1
END


set @t2 = GETDATE()
set @i = 1

while @i < 10000 begin
    insert into T1 values (@i)
    set @i = @i + 1
end



select DATEDIFF(ms, @t1, @t2) AS [UID], DATEDIFF(ms, @t2, getdate()) AS [Int]

the problem with UUIDs is when clustering on them and not using NEWSEQUENTIALID() is that they cause page breaks and fragmentation of the table

now try like this and you see it is almost the same

declare @i int, @t1 datetime, @t2 datetime

set @t1 = GETDATE()
set @i = 1

while @i < 10000 begin
    insert into T2 values (NEWID())
    set @i = @i + 1
END
select DATEDIFF(ms, @t1, getdate()) 

set @t1 = GETDATE()
set @i = 1

while @i < 10000 begin
    insert into T1 values (@i)
    set @i = @i + 1
end



select DATEDIFF(ms, @t1, getdate())

And reversed

declare @i int, @t1 datetime, @t2 datetime



set @t1 = GETDATE()
set @i = 1

while @i < 10000 begin
    insert into T1 values (@i)
    set @i = @i + 1
end

set @t1 = GETDATE()
set @i = 1

while @i < 10000 begin
    insert into T2 values (NEWID())
    set @i = @i + 1
END
select DATEDIFF(ms, @t1, getdate())

You are not measuring the INSERT speed. You are measuring your log flush performance. Since you commit after each INSERT, all those tests are doing are sitting around waiting for commit to harden the log. That is hardly relevant for INSERT performance. And please don't post 'performance' measurements when SET NOCOUNT is OFF...

So lets try this without unnecessary server-client chatter, with a properly sized data, batch commits and pre-grown databases:

:setvar dbname testdb
:setvar testsize 1000000
:setvar batchsize 1000

use master;
go

if db_id('$(dbname)') is not null
begin
    drop database [$(dbname)];
end
go

create database [$(dbname)] 
    on (name='test_data', filename='c:\temp\test_data.mdf', size=10gb)
    log on (name='test_log', filename='c:\temp\test_log.ldf', size=100mb);
go

use [$(dbname)];
go  

CREATE TABLE [dbo].[T1](
    [ID] [int] NOT NULL
 CONSTRAINT [T1_1] PRIMARY KEY CLUSTERED ([ID] ASC) 
)
GO

CREATE TABLE [dbo].[T2](
    [ID] [uniqueidentifier] NOT NULL
 CONSTRAINT [T2_1] PRIMARY KEY CLUSTERED ([ID] ASC)
)
GO

set nocount on;
go

declare @i int, @t1 datetime, @t2 datetime

set @t1 = GETDATE()
set @i = 1

begin transaction;
while @i < $(testsize) begin
    insert into T1 values (@i)
    set @i = @i + 1
    if @i % $(batchsize) = 0
    begin
        commit;
        begin transaction;
    end
end
commit

set @t2 = GETDATE()
set @i = 1
begin transaction
while @i < $(testsize) begin
    insert into T2 values (NEWID())
    set @i = @i + 1
    if @i % $(batchsize) = 0
    begin
        commit;
        begin transaction;
    end
end
commit

select DATEDIFF(ms, @t1, @t2) AS [Int], DATEDIFF(ms, @t2, getdate()) AS [UID]

drop table T1
drop table T2

INTS: 18s
GUIDS: 23s

QED

I expect in a real database rebalancing of an index being a minor problem, because lots of index entries will fit in a single block and as long.

What might become more of an issue might be contention to that single block containing all the new entries. Oracle has a feature to store the bytes of the key in reverse order to spread new entries out over all blocks: http://oracletoday.blogspot.com/2006/09/there-is-option-to-create-index.html Don't know about other databases.

来源：https://stackoverflow.com/questions/4598094/b-trees-databases-sequential-vs-random-inserts-and-speed-random-is-winning

标签

sql

sql-server

benchmarking

clustered-index