How to get boxplot data for matplotlib boxplots

后端 未结 2 1926
灰色年华
灰色年华 2021-02-05 16:48

I need to get the statistical data which were generated to draw a box plot in Pandas(using dataframe to create boxplots). i.e. Quartile1,Quartile2,Quartile3, lower whisker value

相关标签:
2条回答
  • 2021-02-05 17:25

    One option is to use the y data from the plots - probably most useful for the outliers (fliers)

    _, bp = pd.DataFrame.boxplot(df, return_type='both')
    
    outliers = [flier.get_ydata() for flier in bp["fliers"]]
    boxes = [box.get_ydata() for box in bp["boxes"]]
    medians = [median.get_ydata() for median in bp["medians"]]
    whiskers = [whiskers.get_ydata() for whiskers in bp["whiskers"]]
    

    But it's probably more straightforward to get the other values (including IQR) using either

    quantiles = df.quantile([0.01, 0.25, 0.5, 0.75, 0.99])
    

    or, as suggested by WoodChopper

    stats = df.describe()
    
    0 讨论(0)
  • 2021-02-05 17:35
    • To get the boxplot data, use matplotlib.cbook.boxplot_stats, which returns a list of dictionaries of statistics used to draw a series of box and whisker plots using matplotlib.axes.Axes.bxp
      • To get the boxplot statistics, pass an array to boxplot_stats.
        • This is not specific to pandas.
    • The default plot engine for pandas, is matplotlib, so using boxplot_stats will return the correct metrics for pandas.DataFrame.plot.box.
    • Pass the numeric columns of interest, to boxplot_stats, as and array, using df.values
    import pandas as pd
    import matplotlib.pyplot as plt
    form matplotlib.cbook import boxplot_stats
    import numpy as np
    
    # test dataframe
    np.random.seed(346)
    df = pd.DataFrame(np.random.rand(100, 5), columns=['A', 'B', 'C', 'D', 'E'])
    
    # plot the dataframe as needed
    ax = df.plot.box(figsize=(8, 6), showmeans=True)
    ax.grid()
    

    • Extract the boxplot metrics by passing an array to boxplot_metrics
      • df.values is a numpy.ndarray.
    • The dicts are in the same order as the column arrays from df.
    • This data had no outliers, fliers, because it was generated with numpy.random.
    # get stats
    stats = boxplot_stats(df.values)
    
    print(stats)
    [out]:
    [{'cihi': 0.6008396701195271,
      'cilo': 0.45316512285356997,
      'fliers': array([], dtype=float64),
      'iqr': 0.47030110594253877,
      'mean': 0.49412631128104645,
      'med': 0.5270023964865486,
      'q1': 0.2603486498337239,
      'q3': 0.7306497557762627,
      'whishi': 0.9941975539538199,
      'whislo': 0.00892072823759571},
     {'cihi': 0.5460977498205477,
      'cilo': 0.39283808760835964,
      'fliers': array([], dtype=float64),
      'iqr': 0.4880880962171596,
      'mean': 0.47578540593013985,
      'med': 0.4694679187144537,
      'q1': 0.2466015651284032,
      'q3': 0.7346896613455628,
      'whishi': 0.9906905357196321,
      'whislo': 0.002613905425137064},
     {'cihi': 0.6327876179340386,
      'cilo': 0.47317829117336885,
      'fliers': array([], dtype=float64),
      'iqr': 0.5083099578365278,
      'mean': 0.5202481643792808,
      'med': 0.5529829545537037,
      'q1': 0.24608370844800756,
      'q3': 0.7543936662845353,
      'whishi': 0.9968264819096214,
      'whislo': 0.008450848029956215},
     {'cihi': 0.5429786764060252,
      'cilo': 0.40089287519667627,
      'fliers': array([], dtype=float64),
      'iqr': 0.4525025516221303,
      'mean': 0.4948030963370377,
      'med': 0.4719357758013507,
      'q1': 0.279181107815125,
      'q3': 0.7316836594372553,
      'whishi': 0.9836196084903415,
      'whislo': 0.019864664399723786},
     {'cihi': 0.5413819754851169,
      'cilo': 0.3838462046931251,
      'fliers': array([], dtype=float64),
      'iqr': 0.5017062764076173,
      'mean': 0.4922357500877824,
      'med': 0.462614090089121,
      'q1': 0.2490034171367362,
      'q3': 0.7507096935443536,
      'whishi': 0.9984043081918205,
      'whislo': 0.0036707224412856343}]
    
    0 讨论(0)
提交回复
热议问题