How to summarize all possible combinations of variables?

后端 未结 3 636
北恋
北恋 2021-02-11 11:44

I am trying to summarize the count based on the all possible combinations of variables. Here is an example data:

相关标签:
3条回答
  • 2021-02-11 12:05

    For this sort of query using some of the built in aggregate tools is quite straight forward.

    First off setup some sample data based on your sample image:

    declare @Table1 as table
        ([id] int, [a] int, [b] int, [c] int)
    ;
    
    INSERT INTO @Table1
        ([id], [a], [b], [c])
    VALUES
        (10001, 1, 3, 3),
        (10002, 0, 0, 0),
        (10003, 3, 6, 0),
        (10004, 7, 0, 0),
        (10005, 0, 0, 0)
    ;
    

    Since you want the count of IDs for each possible combination of non zero attributes A, B, and C, the first step is eliminate the zeros and convert the non zero values to a single value we can summarize on, in this case I'll use the attributes name. After that it's a simple matter of performing the aggregate, using the CUBE clause in the group by statement to generate the combinations. Lastly in the having clause prune out the unwanted summations. Mostly that's just ignoring the null values in the attributes, and optionally removing the grand summary (count of all rows)

    with t1 as (
    select case a when 0 then null else 'a' end a
         , case b when 0 then null else 'b' end b
         , case c when 0 then null else 'c' end c
         , id
      from @Table1
    )
    select a, b, c, count(id) cnt
      from t1
      group by cube(a,b,c)
      having (a is not null or grouping(a) = 1) -- For each attribute
         and (b is not null or grouping(b) = 1) -- only allow nulls as
         and (c is not null or grouping(c) = 1) -- a result of grouping.
         and grouping_id(a,b,c) <> 7  -- exclude the grand total
      order by grouping_id(a,b,c);
    

    Here are the results:

        a       b       c       cnt
    1   a       b       c       1
    2   a       b       NULL    2
    3   a       NULL    c       1
    4   a       NULL    NULL    3
    5   NULL    b       c       1
    6   NULL    b       NULL    2
    7   NULL    NULL    c       1
    

    And finally my original rextester link: http://rextester.com/YRJ10544

    @lad2025 Here's a dynamic version (sorry my SQL Server skills aren't as strong as my Oracle skills, but it works). Just set the correct values for @Table and @col and it should work as long as all other columns are numeric attributes:

    declare @sql varchar(max), @table varchar(30), @col varchar(30);
    set @table = 'Table1';
    set @col = 'id';
    with x(object_id, column_id, name, names, proj, pred, max_col, cnt) 
      as (
        select object_id, column_id, name, cast(name as varchar(max))
         , cast('case '+name+' when 0 then null else '''+name+''' end '+name as varchar(4000))
         , cast('('+name+' is not null or grouping('+name+') = 1)' as varchar(4000))
         , (select max(column_id) from sys.columns m where m.object_id = c.object_id and m.name <>'ID')
         , 1
         from sys.columns c
        where object_id = OBJECT_ID(@Table)
          and column_id = (select min(column_id) from sys.columns m where m.object_id = c.object_id and m.name <> @col)
        union all
        select x.object_id, c.column_id, c.name, cast(x.names+', '+c.name as varchar(max))
         , cast(proj+char(13)+char(10)+'     , case '+c.name+' when 0 then null else '''+c.name+''' end '+c.name as varchar(4000))
         , cast(pred+char(13)+char(10)+'   and ('+c.name+' is not null or grouping('+c.name+') = 1)' as varchar(4000))
         , max_col
         , cnt+1
          from x join sys.columns c on c.object_id = x.object_id and c.column_id = x.column_id+1
    )
    select @sql='with t1 as (
    select '+proj+'
         , '+@col+'
      from '+@Table+'
    )
    select '+names+'
         , count('+@col+') cnt 
      from t1
     group by cube('+names+')
    having '+pred+'
       and grouping_id('+names+') <> '+cast(power(2,cnt)-1 as varchar(10))+'
     order by grouping_id('+names+');'
      from x where column_id = max_col;
    
    select @sql sql;
    exec (@sql);
    

    Rextester

    0 讨论(0)
  • 2021-02-11 12:17

    Naive approach SQL Server version (I've assumed that we always have 3 columns so there will be 2^3-1 rows):

    SELECT 'A' AS combination, COUNT(DISTINCT CASE WHEN a > 0 THEN a ELSE NULL END) AS cnt FROM t
    UNION ALL 
    SELECT 'B', COUNT(DISTINCT CASE WHEN b > 0 THEN a ELSE NULL END) FROM t
    UNION ALL 
    SELECT 'C', COUNT(DISTINCT CASE WHEN c > 0 THEN a ELSE NULL END) FROM t
    UNION ALL
    SELECT 'A,B', COUNT(DISTINCT CASE WHEN a > 0 THEN CAST(a AS VARCHAR(10)) ELSE NULL END 
                         + ',' + CASE WHEN b > 0 THEN CAST(b AS VARCHAR(10)) ELSE NULL END) FROM t
    UNION ALL
    SELECT 'A,C', COUNT(DISTINCT CASE WHEN a > 0 THEN CAST(a AS VARCHAR(10)) ELSE NULL END 
                         + ',' + CASE WHEN c > 0 THEN CAST(c AS VARCHAR(10)) ELSE NULL END) FROM t
    UNION ALL
    SELECT 'B,C', COUNT(DISTINCT CASE WHEN b > 0 THEN CAST(b AS VARCHAR(10)) ELSE NULL END 
                         + ',' + CASE WHEN c > 0 THEN CAST(c AS VARCHAR(10)) ELSE NULL END) FROM t
    UNION ALL
    SELECT 'A,B,C', COUNT(DISTINCT CASE WHEN a > 0 THEN CAST(a AS VARCHAR(10)) ELSE NULL END 
                         + ',' + CASE WHEN b > 0 THEN CAST(b AS VARCHAR(10)) ELSE NULL END
                         + ',' + CASE WHEN c > 0 THEN CAST(c AS VARCHAR(10)) ELSE NULL END ) FROM t
    ORDER BY combination 
    
     
    

    Rextester Demo


    EDIT:

    Same as above but more concise:

    WITH cte AS (
        SELECT ID
              ,CAST(NULLIF(a,0) AS VARCHAR(10)) a
              ,CAST(NULLIF(b,0) AS VARCHAR(10)) b
              ,CAST(NULLIF(c,0) AS VARCHAR(10)) c 
        FROM t
    )
    SELECT 'A' AS combination, COUNT(DISTINCT a) AS cnt FROM cte UNION ALL 
    SELECT 'B', COUNT(DISTINCT b) FROM cte UNION ALL 
    SELECT 'C', COUNT(DISTINCT c) FROM cte UNION ALL
    SELECT 'A,B', COUNT(DISTINCT a + ',' + b) FROM cte UNION ALL
    SELECT 'A,C', COUNT(DISTINCT a + ',' + c) FROM cte UNION ALL
    SELECT 'B,C', COUNT(DISTINCT b + ',' + c) FROM cte UNION ALL
    SELECT 'A,B,C', COUNT(DISTINCT a + ',' + b + ',' + c ) FROM cte ;
    

    Rextester Demo


    EDIT 2

    Using UNPIVOT:

    WITH cte AS (SELECT ID
                   ,CAST(IIF(a!=0,1,NULL) AS VARCHAR(10)) a
                   ,CAST(IIF(b!=0,1,NULL) AS VARCHAR(10)) b
                   ,CAST(IIF(c!=0,1,NULL) AS VARCHAR(10)) c 
                FROM t)
    SELECT combination, [count]
    FROM (SELECT  a=COUNT(a), b=COUNT(b), c=COUNT(c)
               , ab=COUNT(a+b), ac=COUNT(a+c), bc=COUNT(b+c), abc=COUNT(a+b+c)
          FROM cte) s
    UNPIVOT ([count] FOR combination IN (a,b,c,ab,ac,bc,abc))AS unpvt
    

    Rextester Demo


    EDIT FINAL APPROACH

    I appreciate your approach. I have more than 3 variables in my actual dataset and do you think we can generate all possible combinations programatically rather than the hard coding them! May be your second approach will cover that :

    SQL is a bit clumsy to do this kind of operation, but I want to show it is possible.

    CREATE TABLE t(id INT, a INT, b INT, c INT);
    
    INSERT INTO t
    SELECT 10001,1,3,3 UNION
    SELECT 10002,0,0,0 UNION
    SELECT 10003,3,6,0 UNION
    SELECT 10004,7,0,0 UNION
    SELECT 10005,0,0,0;
    
    DECLARE @Sample AS TABLE 
    (
        item_id     tinyint IDENTITY(1,1) PRIMARY KEY NONCLUSTERED,
        item        nvarchar(500) NOT NULL,
        bit_value   AS  CONVERT ( integer, POWER(2, item_id - 1) )
                    PERSISTED UNIQUE CLUSTERED
    );    
    
    INSERT INTO @Sample
    SELECT name
    FROM sys.columns
    WHERE object_id = OBJECT_ID('t')
      AND name != 'id';
    
    DECLARE @max integer = POWER(2, ( SELECT COUNT(*) FROM @Sample AS s)) - 1;
    DECLARE @cols NVARCHAR(MAX);
    DECLARE @cols_casted NVARCHAR(MAX);
    DECLARE @cols_count NVARCHAR(MAX);
    
    
    ;WITH
      Pass0 as (select 1 as C union all select 1), --2 rows
      Pass1 as (select 1 as C from Pass0 as A, Pass0 as B),--4 rows
      Pass2 as (select 1 as C from Pass1 as A, Pass1 as B),--16 rows
      Pass3 as (select 1 as C from Pass2 as A, Pass2 as B),--256 rows
      Pass4 as (select 1 as C from Pass3 as A, Pass3 as B),--65536 rows
      Tally as (select row_number() over(order by C) as n from Pass4)
    , cte AS (SELECT
        combination =
            STUFF
            (
                (
                    SELECT ',' + s.item 
                    FROM @Sample AS s
                    WHERE
                        n.n & s.bit_value = s.bit_value
                    ORDER BY
                        s.bit_value
                    FOR XML 
                        PATH (''),
                        TYPE                    
                ).value('(./text())[1]', 'varchar(8000)'), 1, 1, ''
            )
    FROM Tally AS N
    WHERE N.n BETWEEN 1 AND @max
    )
    SELECT @cols = STRING_AGG(QUOTENAME(combination),',')
          ,@cols_count = STRING_AGG(FORMATMESSAGE('[%s]=COUNT(DISTINCT %s)'
                        ,combination,REPLACE(combination, ',', ' + '','' +') ),',')
    FROM cte;
    
    SELECT 
      @cols_casted = STRING_AGG(FORMATMESSAGE('CAST(NULLIF(%s,0) AS VARCHAR(10)) %s'
                     ,name, name), ',')
    FROM sys.columns
    WHERE object_id = OBJECT_ID('t')
      AND name != 'id';
      
    DECLARE @sql NVARCHAR(MAX);
    
    SET @sql =
    'SELECT combination, [count]
    FROM (SELECT  <cols_count>
          FROM (SELECT ID, <cols_casted> FROM t )cte) s
    UNPIVOT ([count] FOR combination IN (<cols>))AS unpvt';
    
    SET @sql = REPLACE(@sql, '<cols_casted>', @cols_casted);
    SET @sql = REPLACE(@sql, '<cols_count>', @cols_count);
    SET @sql = REPLACE(@sql, '<cols>', @cols);
    
    SELECT @sql;
    EXEC (@sql);
    

    DBFiddle Demo

    DBFiddle Demo with 4 variables

    0 讨论(0)
  • 2021-02-11 12:29

    Poshan:

    As Robert stated, SUMMARY can be used to count combinations. A second SUMMARY can count the computed types. One difficulty is ignoring the combinations that involve a zero value. If they can be converted to missings the processing is much cleaner. Presuming zeros converted to missing, this code would count distinct combinations:

    proc summary noprint data=have;
      class v2-v4 s1;
      output out=counts_eachCombo;
    run;
    
    proc summary noprint data=counts_eachCombo(rename=_type_=combo_type);
      class combo_type;
      output out=counts_eachClassType;
    run;
    

    You can see how the use of a CLASS variable in a combination determines the TYPE, and the class variables can be of mixed type (numeric, character)

    A different 'home-grown' approach that does not use SUMMARY can use data step with LEXCOMB to compute each combination and SQL with into / separated to generate a SQL statement that will count each distinctly.

    Note: The following code contains macro varListEval for resolving a SAS variable list to individual variable names.

    %macro makeHave(n=,m=,maxval=&m*4,prob0=0.25);
    
      data have;
        do id = 1 to &n;
          array v v1-v&m;
          do over v;
            if ranuni(123) < &prob0 then v = 0; else v = ceil(&maxval*ranuni(123));
          end;
          s1 = byte(65+5*ranuni(123));
          output;
        end;
      run;
    
    %mend;
    
    %makeHave (n=100,m=5,maxval=15)
    
    %macro varListEval (data=, var=);
      %* resolve a SAS variable list to individual variable names;
      %local dsid dsid2 i name num;
      %let dsid = %sysfunc(open(&data));
      %if &dsid %then %do;
        %let dsid2 = %sysfunc(open(&data(keep=&var)));
        %if &dsid2 %then %do;
          %do i = 1 %to %sysfunc(attrn(&dsid,nvar));
            %let name = %sysfunc(varname(&dsid,&i));
            %let num = %sysfunc(varnum(&dsid2,&name));
            %if &num %then "&NAME";
          %end;
          %let dsid2 = %sysfunc(close(&dsid2));
        %end;
        %let dsid = %sysfunc(close(&dsid));
      %end;
      %else
        %put %sysfunc(sysmsg());
    %mend;
    
    %macro combosUCounts(data=, var=);
      %local vars n;
      %let vars = %varListEval(data=&data, var=&var);
    
      %let n = %eval(1 + %sysfunc(count(&vars,%str(" ")));
    
      * compute combination selectors and criteria;
      data combos;
        array _names (&n) $32 (&vars);
        array _combos (&n) $32;
        array _comboCriterias (&n) $200;
    
        length _selector $32000;
        length _criteria $32000;
    
        if 0 then set &data; %* prep PDV for vname;
    
        do _k = 1 to &n;
          do _j = 1 to comb(&n,_k);
            _rc = lexcomb(_j,_k, of _names[*]);
            do _p = 1 to _k;
              _combos(_p) = _names(_p);
              if vtypex(_names(_p)) = 'C' 
                then _comboCriterias(_p) = trim(_names(_p)) || " is not null and " || trim(_names(_p)) || " ne ''";
                else _comboCriterias(_p) = trim(_names(_p)) || " is not null and " || trim(_names(_p)) || " ne 0";
            end;
            _selector = catx(",", of _combos:);
            _criteria = catx(" and ", of _comboCriterias:);
            output;
          end;
        end;
    
        stop;
      run;
    
      %local union;
    
      proc sql noprint;
        * generate SQL statement that uses combination selectors and criteria;
        select "select "
        || quote(trim(_selector))
        || " as combo" 
        || ", "
        || "count(*) as uCount from (select distinct "
        || trim(_selector)
        || " from &data where "
        || trim(_criteria)
        || ")"
        into :union separated by " UNION "
        from combos
        ;
    
        * perform the generated SQL statement;
        create table comboCounts as
        &union;
    
        /* %put union=%superq(union); */
      quit;
    %mend;
    
    options mprint nosymbolgen;
    %combosUCounts(data=have, var=v2-v4);
    %combosUCounts(data=have, var=v2-v4 s1);
    
    %put NOTE: Done;
    /*
    data _null_;
    put %varListEval(data=have, var=v2-v4) ;
    run;
    */
    
    0 讨论(0)
提交回复
热议问题