Design pattern to convert tree-rules from Weka into SQL query

别等时光非礼了梦想. 提交于 2019-12-23 03:17:33

问题


I have some output from Weka that looks like this:

fac_a < 64
|   fac_d < 71.5
|   |   fac_a < 49.5
|   |   |   fac_d < 23.5 : 19.44 (13/43.71) [13/77.47]
|   |   |   fac_d >= 23.5 : 24.25 (32/23.65) [16/49.15]
|   |   fac_a >= 49.5 : 30.8 (10/17.68) [5/22.44]
|   fac_d >= 71.5 : 33.6 (25/53.05) [15/47.35]
fac_a >= 64
|   fac_d < 83.5
|   |   fac_a < 91
|   |   |   fac_e < 93.5
|   |   |   |   fac_d < 45 : 31.9 (16/23.25) [3/64.14]
|   |   |   |   fac_d >= 45
|   |   |   |   |   fac_e < 21.5 : 44.1 (5/16.58) [2/21.39]
|   |   |   |   |   fac_e >= 21.5
|   |   |   |   |   |   fac_a < 77.5 : 33.45 (4/2.89) [1/0.03]
|   |   |   |   |   |   fac_a >= 77.5 : 39.46 (7/10.21) [1/11.69]
|   |   |   fac_e >= 93.5 : 45.97 (2/8.03) [1/107.71]
|   |   fac_a >= 91 : 42.26 (9/9.57) [4/69.03]
|   fac_d >= 83.5 : 47.1 (9/30.24) [6/40.15]

I want to add a column onto my dataset (in MSSQL) that gives me the classification prediction of the response variable based on these rules. It's relatively easy to convert the above into a set of n queries (where n is leaf count on my tree) where the WHERE clause is auto-generated from the branch information:

-- Rule 1
UPDATE table_name
SET prediction=value1
WHERE 
    fac_a < 64 AND 
    fac_d < 71.5 AND 
    fac_a < 49.5 AND 
    fac_d < 23.5
;

-- Rule 2
UPDATE table_name
SET prediction=value2
WHERE 
    fac_a < 64 AND 
    fac_d < 71.5 AND 
    fac_a < 49.5 AND 
    fac_d >= 23.5
;

etc. for each rule

But this doesn't scale well when I have complex trees (ca. 100 leaf nodes) and 100,000+ rows. Is there a design pattern for the SQL query that can apply this tree classification that will allow me to calculate the prediction more efficiently?


回答1:


Here's a thought: put the rules into a hierarchical table, then package the lookup in a recursive user-defined scalar function. See below. (For some reason, SQL Fiddle isn't happy with the user-defined function, but I tested it on SQL Server 2012, and it should work on 2008.)

It's fast on the sample data you provided and a 1000-row table of facts. At the least, it might be easier to manage than what you have now. There are also variations on this approach that might be better, but see what you think.

If your decision tree is over 100 levels deep (or if you didn't populate the table of rules correctly), you'll hit the default recursion depth limit of 100 for functions. This can be changed with OPTION (MAXRECURSION 0) for no limit or OPTION (MAXRECURSION 32767) or less for a higher limit.

create table facdata (
  fac_a decimal(10,4),
  fac_b decimal(10,4),
  fac_c decimal(10,4),
  fac_d decimal(10,4),
  fac_e decimal(10,4),
  val   decimal(10,4)
);

with v(i) as (
  select 40 union all select 50 union all select 70
  union all select 80 union all select 90 union all select 100
)
insert facdata
  select a.i, 30, c.i, d.i, e.i, null
  from v as a, v as c, v as d, v as e
go

create table decisions (
  did hierarchyid primary key,
  fac char,
  split decimal(10,4),
  val decimal(10,4)
)

insert decisions values
  (cast('/0/' as hierarchyid), 'a', 64,null),
  (cast('/0/0/' as hierarchyid), 'd', 71.5,null),
  (cast('/0/0/0/' as hierarchyid), 'a', 49.5,null),
  (cast('/0/0/0/0/' as hierarchyid), 'd', 23.5,null),
  (cast('/0/0/0/0/0/' as hierarchyid), NULL, NULL,19.44),
  (cast('/0/0/0/0/1/' as hierarchyid), NULL, NULL, 24.25),
  (cast('/0/0/0/1/' as hierarchyid), NULL, NULL, 30.8),
  (cast('/0/0/1/' as hierarchyid), NULL, NULL, 33.6),
  (cast('/0/1/' as hierarchyid), 'd', 83.5,null),
  (cast('/0/1/0/' as hierarchyid), 'a', 91,null),
  (cast('/0/1/1/' as hierarchyid), NULL, NULL, 47.1),
  (cast('/0/1/0/0/' as hierarchyid), 'e', 93.5,null),
  (cast('/0/1/0/0/0/' as hierarchyid), 'd', 45,null),
  (cast('/0/1/0/0/0/0/' as hierarchyid), null,null,31.9),
  (cast('/0/1/0/0/0/1/' as hierarchyid), 'e', 21.5,null),
  (cast('/0/1/0/0/0/1/0/' as hierarchyid), null,null,44.1),
  (cast('/0/1/0/0/0/1/1/' as hierarchyid), 'a', 77.5,null),
  (cast('/0/1/0/0/0/1/1/0/' as hierarchyid), NULL,NULL,33.45),
  (cast('/0/1/0/0/0/1/1/1/' as hierarchyid), NULL,NULL,39.46),
  (cast('/0/1/0/0/1/' as hierarchyid), NULL,NULL,45.97),
  (cast('/0/1/0/1/' as hierarchyid), NULL,NULL, 42.26);
go

create function dbo.findvalfrom(
  @h hierarchyid,
  @val_a decimal(10,4),
  @val_b decimal(10,4),
  @val_c decimal(10,4),
  @val_d decimal(10,4),
  @val_e decimal(10,4)
) returns decimal(10,4) as begin
    declare @c char;
    declare @s decimal(10,4);
    declare @v decimal(10,4);
    select
      @c = fac, @s = split, @v = val
    from decisions
    where did = @h
    if @v is not null return @v;

    declare @val decimal(10,4);
    set @val = case when @c='a' then @val_a
                    when @c='b' then @val_b
                    when @c='c' then @val_c
                    when @c='d' then @val_d
                    when @c='e' then @val_e end;

    set @h = cast (@h.ToString()+case when @val<@s then '0/' else '1/' end as hierarchyid);
    return dbo.findvalfrom(@h,@val_a,@val_b,@val_c,@val_d,@val_e);
  end;
go


update facdata set
  val = dbo.findvalfrom('/0/',fac_a,fac_b,fac_c,fac_d,fac_e);
go

select * from facdata;


来源:https://stackoverflow.com/questions/21801068/design-pattern-to-convert-tree-rules-from-weka-into-sql-query

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!