问题
I am looking for a solution for the following problem and it just won't work the way I want to.
So my goal is to calculate a regression analysis and get the slope, intercept, rvalue, pvalue and stderr for multiple rows (this could go up to 10000). In this example, I have a file with 15 rows. Here are the first two rows:
array([
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
23, 24],
[ 100, 10, 61, 55, 29, 77, 61, 42, 70, 73, 98,
62, 25, 86, 49, 68, 68, 26, 35, 62, 100, 56,
10, 97]]
)
Full trial data set:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
100 10 61 55 29 77 61 42 70 73 98 62 25 86 49 68 68 26 35 62 100 56 10 97
57 89 25 89 48 56 67 17 98 10 25 90 17 52 85 56 18 20 74 97 82 63 45 87
192 371 47 173 202 144 17 147 174 483 170 422 285 13 77 116 500 136 276 392 220 121 441 268
The first row is the x-variable and this is the independent variable. This has to be kept fixed while iterating over every following row.
For the following row, the y-variable and thus the dependent variable, I want to calculate the slope, intercept, rvalue, pvalue and stderr and have them in a dataframe (if possible added to the same dataframe, but this is not necessary).
I tried the following code:
import pandas as pd
import scipy.stats
import numpy as np
df = pd.read_excel("Directory\\file.xlsx")
def regr(row):
r = scipy.stats.linregress(df.iloc[1:, :], row)
return r
full_dataframe = None
for index,row in df.iterrows():
x = regr(index)
if full_dataframe is None:
full_dataframe = x.T
else:
full_dataframe = full_dataframe.append([x.T])
full_dataframe.to_excel('Directory\\file.xlsx')
But this fails and gives the following error:
ValueError: all the input array dimensions except for the concatenation axis
must match exactly
I'm really lost in here.
So, I want to achieve that I have the slope, intercept, pvalue, rvalue and stderr per row, starting from the second one, because the first row is the x-variable.
Anyone has an idea HOW to do this and tell me WHY mine isn't working and WHAT the code should look like?
Thanks!!
回答1:
Guessing the issue
Most likely, your problem is the format of your numbers, there are Unicode String dtype('<U21')
instead of being Integer or Float.
Always check types:
df.dtypes
Cast your dataframe using:
df = df.astype(np.float64)
Below a small example showing the issue:
import numpy as np
import pandas as pd
# DataFrame without numbers (will not work for Math):
df = pd.DataFrame(['1', '2', '3'])
df.dtypes # object: placeholder for everything that is not number or timestamps (string, etc...)
# Casting DataFrame to make it suitable for Math Operations:
df = df.astype(np.float64)
df.dtypes # float64
But it is difficult to be sure of this without having the original file or data you are working with.
Carefully read the Exception
This is coherent with the Exception you get:
TypeError: ufunc 'add' did not contain a loop with signature matching types
dtype('<U21') dtype('<U21') dtype('<U21')
The method scipy.stats.linregress
raises a TypeError
(so it is about type) and is telling you than it cannot perform add
operation because adding String dtype('<U21')
does not make any sense in the context of a Linear Regression.
Understand the Design
Loading the data:
import io
fh = io.StringIO("""1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
100 10 61 55 29 77 61 42 70 73 98 62 25 86 49 68 68 26 35 62 100 56 10 97
57 89 25 89 48 56 67 17 98 10 25 90 17 52 85 56 18 20 74 97 82 63 45 87
192 371 47 173 202 144 17 147 174 483 170 422 285 13 77 116 500 136 276 392 220 121 441 268""")
df = pd.read_fwf(fh).astype(np.float)
Then we can regress the second row vs the first:
scipy.stats.linregress(df.iloc[0,:].values, df.iloc[1,:].values)
It returns:
LinregressResult(slope=0.12419744768547877, intercept=49.60998434527584, rvalue=0.11461693561751324, pvalue=0.5938303095361301, stderr=0.22949908667668056)
Assembling all together:
result = pd.DataFrame(columns=["slope", "intercept", "rvalue"])
for i, row in df.iterrows():
fit = scipy.stats.linregress(df.iloc[0,:], row)
result.loc[i] = (fit.slope, fit.intercept, fit.rvalue)
Returns:
slope intercept rvalue
0 1.000000 0.000000 1.000000
1 0.124197 49.609984 0.114617
2 -1.095801 289.293224 -0.205150
Which is, as far as I understand your question, what you expected.
The second exception you get comes because of this line:
x = regr(index)
You sent the index of the row instead of the row itself to the regression method.
来源:https://stackoverflow.com/questions/53788058/linear-regressionvalueerror-all-the-input-array-dimensions-except-for-the-conc