问题
What is the best method to get the simple descriptive statistics of any column in a dataframe (or list or array), be it nested or not, a sort of advanced df.describe() that also includes nested structures with numerical values.
In my case, I have a dataframe with many columns. Some columns have a numerical list in each row (in my case a time series), which is nested structure. It is not important that it is a dataframe, other structures are also included in the question, as changing between them is fast.
I mean nested structures like
- list of arrays,
- array of arrays,
- series of lists,
- dataframe with nested lists of numerical values in some columns (my case)
of which you need to get simple descriptive statistics.
Asking for
df.describe()
will give me just the statistics of the numerical columns, but not those of the columns that include a list with numerical values. I cannot get the statistics just by applying
from scipy import stats
stats.describe(arr)
either as it is the solution in How can I get descriptive statistics of a NumPy array? for a non-nested array.
回答1:
My first approach would be to get the statistics of each numerical list first, and then take the statistics of that again, e.g. the mean of the mean or the mean of the variance would then give me some information as well. In my first approach here, I convert a specific column that has a nested list of numerical values to a series of nested lists first. Nested arrays or lists might need a small adjustment, not tested.
NESTEDSTRUCTURE = df['nestedColumn']
[stats.describe([a[x] for a in [stats.describe(x) for x in NESTEDSTRUCTURE]]) for x in range(6)]
gives you the stats of the stats for a nested structure column. If you want the mean of all means of a column, you can use
stats.describe([a[2] for a in [stats.describe(x) for x in NESTEDSTRUCTURE]])
as position 2 stands for "mean" in
DescribeResult(nobs=, minmax=(, ), mean=, variance=, skewness=, kurtosis=)
I expect that there is a better descriptive statistics approach that should also automatically understand nested structures with numerical values, this is just a workaround.
来源:https://stackoverflow.com/questions/62385252/how-can-i-get-the-statistics-of-all-columns-including-those-with-a-nested-struct