Plotting CDF of a pandas series in python

后端 未结 7 1097
情话喂你
情话喂你 2020-12-23 09:25

Is there a way to do this? I cannot seem an easy way to interface pandas series with plotting a CDF.

相关标签:
7条回答
  • 2020-12-23 09:34

    In case you are also interested in the values, not just the plot.

    import pandas as pd
    
    # If you are in jupyter
    %matplotlib inline
    

    This will always work (discrete and continuous distributions)

    # Define your series
    s = pd.Series([9, 5, 3, 5, 5, 4, 6, 5, 5, 8, 7], name = 'value')
    df = pd.DataFrame(s)
    
    # Get the frequency, PDF and CDF for each value in the series
    
    # Frequency
    stats_df = df \
    .groupby('value') \
    ['value'] \
    .agg('count') \
    .pipe(pd.DataFrame) \
    .rename(columns = {'value': 'frequency'})
    
    # PDF
    stats_df['pdf'] = stats_df['frequency'] / sum(stats_df['frequency'])
    
    # CDF
    stats_df['cdf'] = stats_df['pdf'].cumsum()
    stats_df = stats_df.reset_index()
    stats_df
    

    # Plot the discrete Probability Mass Function and CDF.
    # Technically, the 'pdf label in the legend and the table the should be 'pmf'
    # (Probability Mass Function) since the distribution is discrete.
    
    # If you don't have too many values / usually discrete case
    stats_df.plot.bar(x = 'value', y = ['pdf', 'cdf'], grid = True)
    

    Alternative example with a sample drawn from a continuous distribution or you have a lot of individual values:

    # Define your series
    s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')
    
    # ... all the same calculation stuff to get the frequency, PDF, CDF
    
    # Plot
    stats_df.plot(x = 'value', y = ['pdf', 'cdf'], grid = True)
    

    For continuous distributions only

    Please note if it very reasonable to make the assumption that there is only one occurence of each value in the sample (typically encountered in the case of continuous distributions) then the groupby() + agg('count') is not necessary (since the count is always 1).

    In this case, a percent rank can be used to get to the cdf directly.

    Use your best judgment when taking this kind of shortcut! :)

    # Define your series
    s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')
    df = pd.DataFrame(s)
    
    # Get to the CDF directly
    df['cdf'] = df.rank(method = 'average', pct = True)
    
    # Sort and plot
    df.sort_values('value').plot(x = 'value', y = 'cdf', grid = True)
    

    0 讨论(0)
  • 2020-12-23 09:40

    I came here looking for a plot like this with bars and a CDF line:

    It can be achieved like this:

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    series = pd.Series(np.random.normal(size=10000))
    fig, ax = plt.subplots()
    ax2 = ax.twinx()
    n, bins, patches = ax.hist(series, bins=100, normed=False)
    n, bins, patches = ax2.hist(
        series, cumulative=1, histtype='step', bins=100, color='tab:orange')
    plt.savefig('test.png')
    

    If you want to remove the vertical line, then it's explained how to accomplish that here. Or you could just do:

    ax.set_xlim((ax.get_xlim()[0], series.max()))
    

    I also saw an elegant solution here on how to do it with seaborn.

    0 讨论(0)
  • 2020-12-23 09:42

    I believe the functionality you're looking for is in the hist method of a Series object which wraps the hist() function in matplotlib

    Here's the relevant documentation

    In [10]: import matplotlib.pyplot as plt
    
    In [11]: plt.hist?
    ...
    Plot a histogram.
    
    Compute and draw the histogram of *x*. The return value is a
    tuple (*n*, *bins*, *patches*) or ([*n0*, *n1*, ...], *bins*,
    [*patches0*, *patches1*,...]) if the input contains multiple
    data.
    ...
    cumulative : boolean, optional, default : True
        If `True`, then a histogram is computed where each bin gives the
        counts in that bin plus all bins for smaller values. The last bin
        gives the total number of datapoints.  If `normed` is also `True`
        then the histogram is normalized such that the last bin equals 1.
        If `cumulative` evaluates to less than 0 (e.g., -1), the direction
        of accumulation is reversed.  In this case, if `normed` is also
        `True`, then the histogram is normalized such that the first bin
        equals 1.
    
    ...
    

    For example

    In [12]: import pandas as pd
    
    In [13]: import numpy as np
    
    In [14]: ser = pd.Series(np.random.normal(size=1000))
    
    In [15]: ser.hist(cumulative=True, density=1, bins=100)
    Out[15]: <matplotlib.axes.AxesSubplot at 0x11469a590>
    
    In [16]: plt.show()
    
    0 讨论(0)
  • 2020-12-23 09:43

    A CDF or cumulative distribution function plot is basically a graph with on the X-axis the sorted values and on the Y-axis the cumulative distribution. So, I would create a new series with the sorted values as index and the cumulative distribution as values.

    First create an example series:

    import pandas as pd
    import numpy as np
    ser = pd.Series(np.random.normal(size=100))
    

    Sort the series:

    ser = ser.sort_values()
    

    Now, before proceeding, append again the last (and largest) value. This step is important especially for small sample sizes in order to get an unbiased CDF:

    ser[len(ser)] = ser.iloc[-1]
    

    Create a new series with the sorted values as index and the cumulative distribution as values:

    cum_dist = np.linspace(0.,1.,len(ser))
    ser_cdf = pd.Series(cum_dist, index=ser)
    

    Finally, plot the function as steps:

    ser_cdf.plot(drawstyle='steps')
    
    0 讨论(0)
  • 2020-12-23 09:45

    This is the easiest way.

    import pandas as pd
    df = pd.Series([i for i in range(100)])
    df.hist( cumulative = True )
    

    Image of cumulative histogram

    0 讨论(0)
  • 2020-12-23 09:54

    To me, this seemed like a simply way to do it:

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    
    heights = pd.Series(np.random.normal(size=100))
    
    # empirical CDF
    def F(x,data):
        return float(len(data[data <= x]))/len(data)
    
    vF = np.vectorize(F, excluded=['data'])
    
    plt.plot(np.sort(heights),vF(x=np.sort(heights), data=heights))
    
    0 讨论(0)
提交回复
热议问题