How to plot parallel coordinates on pandas DataFrame with some columns containing strings?

后端 未结 2 1478
小鲜肉
小鲜肉 2021-01-13 01:49

I would like to plot parallel coordinates for a pandas DataFrame containing columns with numbers and other columns containing strings as values.

相关标签:
2条回答
  • 2021-01-13 02:46

    It wasn't entirely clear to me what you wanted to do with the regime column.

    If the problem was just that its presence prevented the plot to show, then you could simply omit the offending columns from the plot:

    parallel_coordinates(df2, class_column='element', cols=['var 1', 'var 2', 'var 3'])
    

    looking at the example you provided, I then understood you want categorical variables to be somehow placed on a vertical lines, and each value of the category is represented by a different y-value. Am I getting this right?

    If I am, then you need to encore your categorical variables (here, regime) into a numerical value. To do this, I used this tip I found on this website.

    df2.regime = df2.regime.astype('category')
    df2['regime_encoded'] = df2.regime.cat.codes
    
    
    print(df2)
        element var 1   var 2   var 3   regime  regime_encoded
    0   line 1  20      30      100     N       0
    1   line 2  10      40      90      N       0
    2   line 3  10      35      120     N-1     1
    

    this code creates a new column (regime_encoded) where each value of the category regime is coded by an integer. You can then plot your new dataframe, including the newly created column:

    parallel_coordinates(df2[['element', 'var 1', 'var 2', 'var 3', 'regime_encoded']],"element")
    

    The problem is that the encoding values for the categorical variable (0, 1) have nothing to do with the range of your other variables, so all the lines seem to tend toward the same point. The answer is then to scale the encoding compared to the range of your data (here I did it very simply because your data was bounded between 0 and 120, you probably need to scale from the minimum value if that's not the case in your real dataframe).

    df2['regime_encoded'] = df2.regime.cat.codes * max(df2.max(axis=1, numeric_only=True))
    parallel_coordinates(df2[['element', 'var 1', 'var 2', 'var 3', 'regime_encoded']],"element")
    

    To fit with your example better, you can add annotations:

    df2['regime_encoded'] = df2.regime.cat.codes * max(df2.max(axis=1, numeric_only=True)
    parallel_coordinates(df2[['element', 'var 1', 'var 2', 'var 3', 'regime_encoded']],"element")
    ax = plt.gca()
    for i,(label,val) in df2.loc[:,['regime','regime_encoded']].drop_duplicates().iterrows():
        ax.annotate(label, xy=(3,val), ha='left', va='center')
    

    0 讨论(0)
  • 2021-01-13 02:53

    Based on @Diziet answer, to be able to get the desired graph under Python 2.5 we can use following code:

    import pandas as pd
    import matplotlib.pyplot as plt
    from pandas.tools.plotting import parallel_coordinates
    
    def format(input):
        if input == "N":
            output = 0
        elif input == "N-1":
            output = 1
        else:
            output = None
        return output
    
    df2 = pd.DataFrame([["line 1",20,30,100,"N"],\
        ["line 2",10,40,90,"N"],["line 3",10,35,120,"N-1"]],\
        columns=["element","var 1","var 2","var 3","regime"])
    df2["regime_encoded"] = df2["regime"].apply(format) * max(df2[["var 1","var 2","var 3"]].max(axis=1))
    
    parallel_coordinates(df2[['element', 'var 1', 'var 2', 'var 3', 'regime_encoded']],"element")
    ax = plt.gca()
    for i,(label,val) in df2.ix[:,['regime','regime_encoded']].drop_duplicates().iterrows():
        ax.annotate(label, xy=(3,val), ha='left', va='center')
    
    plt.show()
    

    This will end up showing following graph:

    0 讨论(0)
提交回复
热议问题