问题
Situation:
Lets consider a massive retail network (hundreds of products and thousands of stores) simplified as follows:
Store 1, Store 2
Product A, Product B, Product C
I am trying to identify anomalies in sales numbers to know which stores do very well and which do very badly.
My first idea was to calculate the means and standard deviations of sales and qualify as anomalies everything that is outside the bounds of 3 standard deviations (~5% of the cases in a normal distribution).
However, when visually checking the distributions, I noticed they are not normally distributed. Considering the many possible store/product combinations, I cannot check and transform each of them manually.
Approach:
The procedure would be, in Python, to first test the distribution of each store/product sales for normality, using D'Agostino and Pearson's Test. I read that Shapiro-Wilk test works better on smaller datasets, so I ruled it out.
We then filter the dataframe to keep only combinations that failed the normality test and transform it.
- Either we can use information such as Skewness and Kurtosis to automatically decide what to use (log, sqrt, etc...)
- Or use different transformations and test again for normality to know which ones worked.
Once all the combinations are normally distributed, I can proceed with my analysis, classifying everything outside 3 standard deviations as anomalies.
At the moment, I have found some code to run the normality test:
# D'Agostino and Pearson's Test
from numpy.random import seed
from numpy.random import randn
from scipy.stats import normaltest
# seed the random number generator
seed(1)
# generate univariate observations
data = 5 * randn(100) + 50
# normality test
stat, p = normaltest(data)
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
print('Sample looks Gaussian (fail to reject H0)')
else:
print('Sample does not look Gaussian (reject H0)')
However, I don't know how to run it in a Pandas dataframe automatically filtering the thousands store/product combinations.
I have a hunch that I could create a dataframe with all unique combinations using itertool, then use this reference dataframe as a filter to run the test across the whole sales dataframe in a unique function, but to be honest, this is way above my current skill.
from itertools import product
pd.DataFrame(list(product(Store, Product)), columns=['Store', 'Product'])
Store Product
0 1 A
1 1 B
2 1 C
3 2 A
4 2 B
5 2 C
Could I ask for ideas or leads to complete this task, please? Maybe I'm even trying to re-invent the wheel and something similar already exists, but I haven't found anything. What do you think about the overall approach?
In any case, I'd be glad to read your input.
Thank you.
UPDATE:
from scipy.stats import normaltest
alpha = 0.05
test = df.groupby(['Store', 'Product'])['Sales'].apply(normaltest)
normality = test.apply(pd.Series, index=["stat", "p"])
normality['normal'] = np.where(normality['p']>alpha, 1,0)
来源:https://stackoverflow.com/questions/58805644/multiple-distribution-normality-testing-and-transformation-in-pandas-dataframe