Are there any Linear Regression Function in SQL Server?

前端 未结 8 1542
遇见更好的自我
遇见更好的自我 2020-12-12 17:19

Are there any Linear Regression Function in SQL Server 2005/2008, similar to the the Linear Regression functions in Oracle ?

相关标签:
8条回答
  • 2020-12-12 17:52

    Here it is as a function that takes a table type of type: table (Y float, X double) which is called XYDoubleType and assumes our linear function is of the form AX + B. It returns A and B a Table column just in case you want to have it in a join or something

    CREATE FUNCTION FN_GetABForData(
     @XYData as XYDoubleType READONLY
     ) RETURNS  @ABData TABLE(
                A  FLOAT,
                B FLOAT, 
                Rsquare FLOAT )
     AS
     BEGIN
        DECLARE @sx FLOAT, @sy FLOAT
        DECLARE @sxx FLOAT,@syy FLOAT, @sxy FLOAT,@sxsy FLOAT, @sxsx FLOAT, @sysy FLOAT
        DECLARE @n FLOAT, @A FLOAT, @B FLOAT, @Rsq FLOAT
    
        SELECT @sx =SUM(D.X) ,@sy =SUM(D.Y), @sxx=SUM(D.X*D.X),@syy=SUM(D.Y*D.Y),
            @sxy =SUM(D.X*D.Y),@n =COUNT(*)
        From @XYData D
        SET @sxsx =@sx*@sx
        SET @sxsy =@sx*@sy
        SET @sysy = @sy*@sy
    
        SET @A = (@n*@sxy -@sxsy)/(@n*@sxx -@sxsx)
        SET @B = @sy/@n  - @A*@sx/@n
        SET @Rsq = POWER((@n*@sxy -@sxsy),2)/((@n*@sxx-@sxsx)*(@n*@syy -@sysy))
    
        INSERT INTO @ABData (A,B,Rsquare) VALUES(@A,@B,@Rsq)
    
        RETURN 
     END
    
    0 讨论(0)
  • 2020-12-12 17:58

    To add to @icc97 answer, I have included the weighted versions for the slope and the intercept. If the values are all constant the slope will be NULL (with the appropriate settings SET ARITHABORT OFF; SET ANSI_WARNINGS OFF;) and will need to be substituted for 0 via coalesce().

    Here is a solution written in SQL:

    with d as (select segment,w,x,y from somedatasource)
    select segment,
    
    avg(y) - avg(x) *
    ((count(*) * sum(x*y)) - (sum(x)*sum(y)))/
    ((count(*) * sum(x*x)) - (Sum(x)*Sum(x)))   as intercept,
    
    ((count(*) * sum(x*y)) - (sum(x)*sum(y)))/
    ((count(*) * sum(x*x)) - (sum(x)*sum(x))) AS slope,
    
    avg(y) - ((avg(x*y) - avg(x)*avg(y))/var_samp(X)) * avg(x) as interceptUnstable,
    (avg(x*y) - avg(x)*avg(y))/var_samp(X) as slopeUnstable,
    (Avg(x * y) - Avg(x) * Avg(y)) / (stddev_pop(x) * stddev_pop(y)) as correlationUnstable,
    
    (sum(y*w)/sum(w)) - (sum(w*x)/sum(w)) *
    ((sum(w)*sum(x*y*w)) - (sum(x*w)*sum(y*w)))/
      ((sum(w)*sum(x*x*w)) - (sum(x*w)*sum(x*w)))   as wIntercept,
    
    ((sum(w)*sum(x*y*w)) - (sum(x*w)*sum(y*w)))/
      ((sum(w)*sum(x*x*w)) - (sum(x*w)*sum(x*w))) as wSlope,
    
    (count(*) * sum(x * y) - sum(x) * sum(y)) / (sqrt(count(*) * sum(x * x) - sum(x) * sum(x))
    * sqrt(count(*) * sum(y * y) - sum(y) * sum(y))) as correlation,
    
    (sum(w) * sum(x*y*w) - sum(x*w) * sum(y*w)) /
    (sqrt(sum(w) * sum(x*x*w) - sum(x*w) * sum(x*w)) * sqrt(sum(w) * sum(y*y*w)
    - sum(y*w) * sum(y*w))) as wCorrelation,
    
    count(*) as n
    
    from d where x is not null and y is not null group by segment
    

    Where w is the weight. I double checked this against R to confirm the results. One may need to cast the data from somedatasource to floating point. I included the unstable versions to warn you against those. (Special thanks goes to Stephan in another answer.)

    Update: added weighted correlation

    0 讨论(0)
  • 2020-12-12 18:05

    There are no linear regression functions in SQL Server. But to calculate a Simple Linear Regression (Y' = bX + A) between pairs of data points x,y - including the calculation of the Correlation Coefficient, Coefficient of Determination (R^2) and Standard Estimate of Error (Standard Deviation), do the following:

    For a table regression_data with numeric columns x and y:

    declare @total_points int 
    declare @intercept DECIMAL(38, 10)
    declare @slope DECIMAL(38, 10)
    declare @r_squared DECIMAL(38, 10)
    declare @standard_estimate_error DECIMAL(38, 10)
    declare @correlation_coefficient DECIMAL(38, 10)
    declare @average_x  DECIMAL(38, 10)
    declare @average_y  DECIMAL(38, 10)
    declare @sumX DECIMAL(38, 10)
    declare @sumY DECIMAL(38, 10)
    declare @sumXX DECIMAL(38, 10)
    declare @sumYY DECIMAL(38, 10)
    declare @sumXY DECIMAL(38, 10)
    declare @Sxx DECIMAL(38, 10)
    declare @Syy DECIMAL(38, 10)
    declare @Sxy DECIMAL(38, 10)
    
    Select 
    @total_points = count(*),
    @average_x = avg(x),
    @average_y = avg(y),
    @sumX = sum(x),
    @sumY = sum(y),
    @sumXX = sum(x*x),
    @sumYY = sum(y*y),
    @sumXY = sum(x*y)
    from regression_data
    
    set @Sxx = @sumXX - (@sumX * @sumX) / @total_points
    set @Syy = @sumYY - (@sumY * @sumY) / @total_points
    set @Sxy = @sumXY - (@sumX * @sumY) / @total_points
    
    set @correlation_coefficient = @Sxy / SQRT(@Sxx * @Syy) 
    set @slope = (@total_points * @sumXY - @sumX * @sumY) / (@total_points * @sumXX - power(@sumX,2))
    set @intercept = @average_y - (@total_points * @sumXY - @sumX * @sumY) / (@total_points * @sumXX - power(@sumX,2)) * @average_x
    set @r_squared = (@intercept * @sumY + @slope * @sumXY - power(@sumY,2) / @total_points) / (@sumYY - power(@sumY,2) / @total_points)
    
    -- calculate standard_estimate_error (standard deviation)
    Select
    @standard_estimate_error = sqrt(sum(power(y - (@slope * x + @intercept),2)) / @total_points)
    From regression_data
    
    0 讨论(0)
  • 2020-12-12 18:07

    I've actually written an SQL routine using Gram-Schmidt orthoganalization. It, as well as other machine learning and forecasting routines, is available at sqldatamine.blogspot.com

    At the suggestion of Brad Larson I've added the code here rather than just direct users to my blog. This produces the same results as the linest function in Excel. My primary source is Elements of Statistical Learning (2008) by Hastie, Tibshirni and Friedman.

    --Create a table of data
    create table #rawdata (id int,area float, rooms float, odd float,  price float)
    
    insert into #rawdata select 1, 2201,3,1,400
    insert into #rawdata select 2, 1600,3,0,330
    insert into #rawdata select 3, 2400,3,1,369
    insert into #rawdata select 4, 1416,2,1,232
    insert into #rawdata select 5, 3000,4,0,540
    
    --Insert the data into x & y vectors
    select id xid, 0 xn,1 xv into #x from #rawdata
    union all
    select id, 1,rooms  from #rawdata
    union all
    select id, 2,area  from #rawdata
    union all
    select id, 3,odd  from #rawdata
    
    select id yid, 0 yn, price yv  into #y from #rawdata
    
    --create a residuals table and insert the intercept (1)
    create table #z (zid int, zn int, zv float)
    insert into #z select id , 0 zn,1 zv from #rawdata
    
    --create a table for the orthoganal (#c) & regression(#b) parameters
    create table #c(cxn int, czn int, cv float) 
    create table #b(bn int, bv float) 
    
    
    --@p is the number of independent variables including the intercept (@p = 0)
    declare @p int
    set @p = 1
    
    
    --Loop through each independent variable and estimate the orthagonal parameter (#c)
    -- then estimate the residuals and insert into the residuals table (#z)
    while @p <= (select max(xn) from #x)
    begin   
            insert into #c
        select  xn cxn,  zn czn, sum(xv*zv)/sum(zv*zv) cv 
            from #x join  #z on  xid = zid where zn = @p-1 and xn>zn group by xn, zn
    
        insert into #z
        select zid, xn,xv- sum(cv*zv) 
            from #x join #z on xid = zid   join  #c  on  czn = zn and cxn = xn  where xn = @p and zn<xn  group by zid, xn,xv
    
        set @p = @p +1
    end
    
    --Loop through each independent variable and estimate the regression parameter by regressing the orthoganal
    -- resiuduals on the dependent variable y
    while @p>=0 
    begin
    
        insert into #b
        select zn, sum(yv*zv)/ sum(zv*zv) 
            from #z  join 
                (select yid, yv-isnull(sum(bv*xv),0) yv from #x join #y on xid = yid left join #b on  xn=bn group by yid, yv) y
            on zid = yid where zn = @p  group by zn
    
        set @p = @p-1
    end
    
    --The regression parameters
    select * from #b
    
    --Actual vs. fit with error
    select yid, yv, fit, yv-fit err from #y join 
        (select xid, sum(xv*bv) fit from #x join #b on xn = bn  group by xid) f
         on yid = xid
    
    --R Squared
    select 1-sum(power(err,2))/sum(power(yv,2)) from 
    (select yid, yv, fit, yv-fit err from #y join 
        (select xid, sum(xv*bv) fit from #x join #b on xn = bn  group by xid) f
         on yid = xid) d
    
    0 讨论(0)
  • 2020-12-12 18:10

    This is an alternate method, based off a blog post on Linear Regression in T-SQL, which uses the following equations:

    enter image description here

    The SQL suggestion in the blog uses cursors though. Here's a prettified version of a forum answer that I used:

    table
    -----
    X (numeric)
    Y (numeric)
    
    /**
     * m = (nSxy - SxSy) / (nSxx - SxSx)
     * b = Ay - (Ax * m)
     * N.B. S = Sum, A = Mean
     */
    DECLARE @n INT
    SELECT @n = COUNT(*) FROM table
    SELECT (@n * SUM(X*Y) - SUM(X) * SUM(Y)) / (@n * SUM(X*X) - SUM(X) * SUM(X)) AS M,
           AVG(Y) - AVG(X) *
           (@n * SUM(X*Y) - SUM(X) * SUM(Y)) / (@n * SUM(X*X) - SUM(X) * SUM(X)) AS B
    FROM table
    
    0 讨论(0)
  • 2020-12-12 18:13

    I hope the following answer helps one understand where some of the solutions come from. I am going to illustrate it with a simple example, but the generalization to many variables is theoretically straightforward as long as you know how to use index notation or matrices. For implementing the solution for anything beyond 3 variables you'll Gram-Schmidt (See Colin Campbell's answer above) or another matrix inversion algorithm.

    Since all the functions we need are variance, covariance, average, sum etc. are aggregation functions in SQL, one can easily implement the solution. I've done so in HIVE to do linear calibration of the scores of a Logistic model - amongst many advantages, one is that you can function entirely within HIVE without going out and back in from some scripting language.

    The model for your data (x_1, x_2, y) where your data points are indexed by i, is

    y(x_1, x_2) = m_1*x_1 + m_2*x_2 + c

    The model appears "linear", but needn't be, For example x_2 can be any non-linear function of x_1, as long as it has no free parameters in it, e.g. x_2 = Sinh(3*(x_1)^2 + 42). Even if x_2 is "just" x_2 and the model is linear, the regression problem isn't. Only when you decide that the problem is to find the parameters m_1, m_2, c such that they minimize the L2 error do you have a Linear Regression problem.

    The L2 error is sum_i( (y[i] - f(x_1[i], x_2[i]))^2 ). Minimizing this w.r.t. the 3 parameters (set the partial derivatives w.r.t. each parameter = 0) yields 3 linear equations for 3 unknowns. These equations are LINEAR in the parameters (this is what makes it Linear Regression) and can be solved analytically. Doing this for a simple model (1 variable, linear model, hence two parameters) is straightforward and instructive. The generalization to a non-Euclidean metric norm on the error vector space is straightforward, the diagonal special case amounts to using "weights".

    Back to our model in two variables:

    y = m_1*x_1 + m_2*x_2 + c

    Take the expectation value =>

    = m_1* + m_2* + c (0)

    Now take the covariance w.r.t. x_1 and x_2, and use cov(x,x) = var(x):

    cov(y, x_1) = m_1*var(x_1) + m_2*covar(x_2, x_1) (1)

    cov(y, x_2) = m_1*covar(x_1, x_2) + m_2*var(x_2) (2)

    These are two equations in two unknowns, which you can solve by inverting the 2X2 matrix.

    In matrix form: ... which can be inverted to yield ... where

    det = var(x_1)*var(x_2) - covar(x_1, x_2)^2

    (oh barf, what the heck are "reputation points? Gimme some if you want to see the equations.)

    In any case, now that you have m1 and m2 in closed form, you can solve (0) for c.

    I checked the analytical solution above to Excel's Solver for a quadratic with Gaussian noise and the residual errors agree to 6 significant digits.

    Contact me if you want to do Discrete Fourier Transform in SQL in about 20 lines.

    0 讨论(0)
提交回复
热议问题