Linear regression:ValueError: all the input array dimensions except for the concatenation axis must match exactly

主宰稳场 提交于 2021-02-07 10:52:43

问题


I am looking for a solution for the following problem and it just won't work the way I want to.

So my goal is to calculate a regression analysis and get the slope, intercept, rvalue, pvalue and stderr for multiple rows (this could go up to 10000). In this example, I have a file with 15 rows. Here are the first two rows:

array([

[   1,    2,    3,    4,    5,    6,    7,    8,    9,   10,   11,
          12,   13,   14,   15,   16,   17,   18,   19,   20,   21,   22,
          23,   24],    

[ 100,   10,   61,   55,   29,   77,   61,   42,   70,   73,   98,
          62,   25,   86,   49,   68,   68,   26,   35,   62,  100,   56,
          10,   97]]
)

Full trial data set:

1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19  20  21  22  23  24

100 10  61  55  29  77  61  42  70  73  98  62  25  86  49  68  68  26  35  62  100 56  10  97

57  89  25  89  48  56  67  17  98  10  25  90  17  52  85  56  18  20  74  97  82  63  45  87

192 371 47  173 202 144 17  147 174 483 170 422 285 13  77  116 500 136 276 392 220 121 441 268

The first row is the x-variable and this is the independent variable. This has to be kept fixed while iterating over every following row.

For the following row, the y-variable and thus the dependent variable, I want to calculate the slope, intercept, rvalue, pvalue and stderr and have them in a dataframe (if possible added to the same dataframe, but this is not necessary).

I tried the following code:

import pandas as pd
import scipy.stats
import numpy as np
df = pd.read_excel("Directory\\file.xlsx")

def regr(row):
    r = scipy.stats.linregress(df.iloc[1:, :], row)
    return r

full_dataframe = None

for index,row in df.iterrows():
    x = regr(index)
   if full_dataframe is None: 
       full_dataframe = x.T
   else: 
       full_dataframe = full_dataframe.append([x.T])

full_dataframe.to_excel('Directory\\file.xlsx')

But this fails and gives the following error:

ValueError: all the input array dimensions except for the concatenation axis 
must match exactly

I'm really lost in here.

So, I want to achieve that I have the slope, intercept, pvalue, rvalue and stderr per row, starting from the second one, because the first row is the x-variable.

Anyone has an idea HOW to do this and tell me WHY mine isn't working and WHAT the code should look like?

Thanks!!


回答1:


Guessing the issue

Most likely, your problem is the format of your numbers, there are Unicode String dtype('<U21') instead of being Integer or Float.

Always check types:

df.dtypes

Cast your dataframe using:

df = df.astype(np.float64)

Below a small example showing the issue:

import numpy as np
import pandas as pd

# DataFrame without numbers (will not work for Math):
df = pd.DataFrame(['1', '2', '3'])
df.dtypes # object: placeholder for everything that is not number or timestamps (string, etc...)

# Casting DataFrame to make it suitable for Math Operations:
df = df.astype(np.float64) 
df.dtypes # float64

But it is difficult to be sure of this without having the original file or data you are working with.

Carefully read the Exception

This is coherent with the Exception you get:

TypeError: ufunc 'add' did not contain a loop with signature matching types 
dtype('<U21') dtype('<U21') dtype('<U21')

The method scipy.stats.linregress raises a TypeError (so it is about type) and is telling you than it cannot perform add operation because adding String dtype('<U21') does not make any sense in the context of a Linear Regression.

Understand the Design

Loading the data:

import io

fh = io.StringIO("""1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19  20  21  22  23  24
100 10  61  55  29  77  61  42  70  73  98  62  25  86  49  68  68  26  35  62  100 56  10  97
57  89  25  89  48  56  67  17  98  10  25  90  17  52  85  56  18  20  74  97  82  63  45  87
192 371 47  173 202 144 17  147 174 483 170 422 285 13  77  116 500 136 276 392 220 121 441 268""")

df = pd.read_fwf(fh).astype(np.float)

Then we can regress the second row vs the first:

scipy.stats.linregress(df.iloc[0,:].values, df.iloc[1,:].values)

It returns:

LinregressResult(slope=0.12419744768547877, intercept=49.60998434527584, rvalue=0.11461693561751324, pvalue=0.5938303095361301, stderr=0.22949908667668056)

Assembling all together:

result = pd.DataFrame(columns=["slope", "intercept", "rvalue"])
for i, row in df.iterrows():
    fit = scipy.stats.linregress(df.iloc[0,:], row)
    result.loc[i] = (fit.slope, fit.intercept, fit.rvalue)

Returns:

      slope   intercept    rvalue
0  1.000000    0.000000  1.000000
1  0.124197   49.609984  0.114617
2 -1.095801  289.293224 -0.205150

Which is, as far as I understand your question, what you expected.

The second exception you get comes because of this line:

x = regr(index)

You sent the index of the row instead of the row itself to the regression method.



来源:https://stackoverflow.com/questions/53788058/linear-regressionvalueerror-all-the-input-array-dimensions-except-for-the-conc

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!