Pandas 'describe' is not returning summary of all columns

后端 未结 6 1944
野的像风
野的像风 2020-12-23 16:16

I am running \'describe()\' on a dataframe and getting summaries of only int columns (pandas 14.0).

The documentation says that for object columns frequency of most

相关标签:
6条回答
  • 2020-12-23 16:49

    You can execute df_test.info() to get the list of datatypes your data frame contains.If your data frame contains only numerical columns than df_test.describe() will work perfectly fine.As by default, it provides the summary of numerical values. If you want the summary of your Object(String) features you can use df_test.describe(include=['O']).

    Or in short, you can just use df_test.describe(include='all') to get summary of all the feature columns when your data frame has columns of various data types.

    0 讨论(0)
  • 2020-12-23 16:53

    pd.options.display.max_columns = DATA.shape[1] will work.

    Here DATA is a 2d matrix, and above code will display stats vertically.

    0 讨论(0)
  • 2020-12-23 16:57

    'describe()' on a DataFrame only works for numeric types. If you think you have a numeric variable and it doesn't show up in 'decribe()', change the type with:

    df[['col1', 'col2']] = df[['col1', 'col2']].astype(float)
    

    You could also create new columns for handling the numeric part of a mix type column, or convert strings to numbers using a dictionary and the map() function.

    'describe()' on a non-numerical Series will give you some statistics (like count, unique and the most frequently occurring value).

    0 讨论(0)
  • 2020-12-23 17:00

    As of pandas v15.0, use the parameter, DataFrame.describe(include = 'all') to get a summary of all the columns when the dataframe has mixed column types. The default behavior is to only provide a summary for the numerical columns.

    Example:

    In[1]:
    
    df = pd.DataFrame({'$a':['a', 'b', 'c', 'd', 'a'], '$b': np.arange(5)})
    df.describe(include = 'all')
    
    Out[1]:
    
            $a    $b
    count   5   5.000000
    unique  4   NaN
    top     a   NaN
    freq    2   NaN
    mean    NaN 2.000000
    std     NaN 1.581139
    min     NaN 0.000000
    25%     NaN 1.000000
    50%     NaN 2.000000
    75%     NaN 3.000000
    max     NaN 4.000000
    

    The numerical columns will have NaNs for summary statistics pertaining to objects (strings) and vice versa.

    Summarizing only numerical or object columns

    1. To call describe() on just the numerical columns use describe(include = [np.number])
    2. To call describe() on just the objects (strings) using describe(include = ['O']).

      In[2]:
      
      df.describe(include = [np.number])
      
      Out[3]:
      
               $b
      count   5.000000
      mean    2.000000
      std     1.581139
      min     0.000000
      25%     1.000000
      50%     2.000000
      75%     3.000000
      max     4.000000
      
      In[3]:
      
      df.describe(include = ['O'])
      
      Out[3]:
      
          $a
      count   5
      unique  4
      top     a
      freq    2
      
    0 讨论(0)
  • 2020-12-23 17:03

    In addition to DataFrame.describe(include = 'all') one can also use Series.value_counts() for each categorical column:

    In[1]:
    
    df = pd.DataFrame({'$a':['a', 'b', 'c', 'd', 'a'], '$b': np.arange(5)})
    df['$a'].value_counts()
    
    Out[1]:
    $a
    a    2
    d    1
    b    1
    c    1
    
    0 讨论(0)
  • 2020-12-23 17:04

    In addition to the data type issues discussed in the other answers, you might also have too many columns to display. If there are too many columns, the middle columns will be replaced with a total of three dots (...).

    Other answers have pointed out that the include='all' parameter of describe can help with the data type issue. Another question asked, "How do I expand the output display to see more columns?" The solution is to modify the display.max_columns setting, which can even be done temporarily. For example, to display up to 40 columns of output from a single describe statement:

    with pd.option_context('display.max_columns', 40):
        print(df.describe(include='all'))
    
    0 讨论(0)
提交回复
热议问题