How can I get the statistics of all columns including those with a nested structure of numerical values in a dataframe, list or array?

走远了吗. 提交于 2020-07-10 10:23:26

问题


What is the best method to get the simple descriptive statistics of any column in a dataframe (or list or array), be it nested or not, a sort of advanced df.describe() that also includes nested structures with numerical values.

In my case, I have a dataframe with many columns. Some columns have a numerical list in each row (in my case a time series), which is nested structure. It is not important that it is a dataframe, other structures are also included in the question, as changing between them is fast.

I mean nested structures like

  • list of arrays,
  • array of arrays,
  • series of lists,
  • dataframe with nested lists of numerical values in some columns (my case)

of which you need to get simple descriptive statistics.

Asking for

df.describe() 

will give me just the statistics of the numerical columns, but not those of the columns that include a list with numerical values. I cannot get the statistics just by applying

from scipy import stats
stats.describe(arr)

either as it is the solution in How can I get descriptive statistics of a NumPy array? for a non-nested array.


回答1:


My first approach would be to get the statistics of each numerical list first, and then take the statistics of that again, e.g. the mean of the mean or the mean of the variance would then give me some information as well. In my first approach here, I convert a specific column that has a nested list of numerical values to a series of nested lists first. Nested arrays or lists might need a small adjustment, not tested.

NESTEDSTRUCTURE = df['nestedColumn']

[stats.describe([a[x] for a in [stats.describe(x) for x in NESTEDSTRUCTURE]]) for x in range(6)]

gives you the stats of the stats for a nested structure column. If you want the mean of all means of a column, you can use

stats.describe([a[2] for a in [stats.describe(x) for x in NESTEDSTRUCTURE]])

as position 2 stands for "mean" in

DescribeResult(nobs=, minmax=(, ), mean=, variance=, skewness=, kurtosis=)

I expect that there is a better descriptive statistics approach that should also automatically understand nested structures with numerical values, this is just a workaround.



来源:https://stackoverflow.com/questions/62385252/how-can-i-get-the-statistics-of-all-columns-including-those-with-a-nested-struct

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!