How can I get descriptive statistics of a NumPy array?

前端 未结 3 2013
臣服心动
臣服心动 2021-01-01 09:50

I use the following code to create a numpy-ndarray. The file has 9 columns. I explicitly type each column:

dataset = np.genfromtxt(\"data.csv\", delimiter=\"         


        
相关标签:
3条回答
  • 2021-01-01 10:16

    The question of how to deal with mixed data from genfromtxt comes up often. People expect a 2d array, and instead get a 1d that they can't index by column. That's because they get a structured array - with different dtype for each column.

    All the examples in the genfromtxt doc show this:

    >>> s = StringIO("1,1.3,abcde")
    >>> data = np.genfromtxt(s, dtype=[('myint','i8'),('myfloat','f8'),
    ... ('mystring','S5')], delimiter=",")
    >>> data
    array((1, 1.3, 'abcde'),
          dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])
    

    But let me demonstrate how to access this kind of data

    In [361]: txt=b"""A, 1,2,3
         ...: B,4,5,6
         ...: """
    In [362]: data=np.genfromtxt(txt.splitlines(),delimiter=',',dtype=('S1,int,float,int'))
    In [363]: data
    Out[363]: 
    array([(b'A', 1, 2.0, 3), (b'B', 4, 5.0, 6)], 
          dtype=[('f0', 'S1'), ('f1', '<i4'), ('f2', '<f8'), ('f3', '<i4')])
    

    So my array has 2 records (check the shape), which are displayed as tuples in a list.

    You access fields by name, not by column number (do I need to add a structured array documentation link?)

    In [364]: data['f0']
    Out[364]: 
    array([b'A', b'B'], 
          dtype='|S1')
    In [365]: data['f1']
    Out[365]: array([1, 4])
    

    In a case like this might be more useful if I choose a dtype with 'subarrays'. This a more advanced dtype topic

    In [367]: data=np.genfromtxt(txt.splitlines(),delimiter=',',dtype=('S1,(3)float'))
    In [368]: data
    Out[368]: 
    array([(b'A', [1.0, 2.0, 3.0]), (b'B', [4.0, 5.0, 6.0])], 
          dtype=[('f0', 'S1'), ('f1', '<f8', (3,))])
    In [369]: data['f1']
    Out[369]: 
    array([[ 1.,  2.,  3.],
           [ 4.,  5.,  6.]])
    

    The character column is still loaded as S1, but the numbers are now in a 3 column array. Note that they are all float (or int).

    In [371]: from scipy import stats
    In [372]: stats.describe(data['f1'])
    Out[372]: DescribeResult(nobs=2, 
       minmax=(array([ 1.,  2.,  3.]), array([ 4.,  5.,  6.])),
       mean=array([ 2.5,  3.5,  4.5]), 
       variance=array([ 4.5,  4.5,  4.5]), 
       skewness=array([ 0.,  0.,  0.]), 
       kurtosis=array([-2., -2., -2.]))
    
    0 讨论(0)
  • 2021-01-01 10:28
    import pandas as pd
    import numpy as np
    
    df_describe = pd.DataFrame(dataset)
    df_describe.describe()
    

    please note that dataset is your np.array to describe.

    import pandas as pd
    import numpy as np
    
    df_describe = pd.DataFrame('your np.array')
    df_describe.describe()
    
    0 讨论(0)
  • 2021-01-01 10:32

    This is not a pretty solution, but it gets the job done. The problem is that by specifying multiple dtypes, you are essentially making a 1D-array of tuples (actually np.void), which cannot be described by stats as it includes multiple different types, incl. strings.

    This could be resolved by either reading it in two rounds, or using pandas with read_csv.

    If you decide to stick to numpy:

    import numpy as np
    a = np.genfromtxt('sample.txt', delimiter=",",unpack=True,usecols=range(1,9))
    s = np.genfromtxt('sample.txt', delimiter=",",unpack=True,usecols=0,dtype='|S1')
    
    from scipy import stats
    for arr in a: #do not need the loop at this point, but looks prettier
        print(stats.describe(arr))
    #Output per print:
    DescribeResult(nobs=6, minmax=(0.34999999999999998, 0.70999999999999996), mean=0.54500000000000004, variance=0.016599999999999997, skewness=-0.3049304880932534, kurtosis=-0.9943046886340534)
    

    Note that in this example the final array has dtype as float, not int, but can easily (if necessary) be converted to int using arr.astype(int)

    0 讨论(0)
提交回复
热议问题