plotting a 2D matrix in python, code and most useful visualization

后端 未结 2 1018
醉梦人生
醉梦人生 2021-01-05 08:15

I have a very large matrix(10x55678) in \"numpy\" matrix format. the rows of this matrix correspond to some \"topics\" and the columns correspond to words(unique words from

相关标签:
2条回答
  • 2021-01-05 08:51

    You could certainly use matplotlib's imshowor pcolor method to display the data, but as comments have mentioned, it might be hard to interpret without zooming in on subsets of the data.

    a = np.random.normal(0.0,0.5,size=(5000,10))**2
    a = a/np.sum(a,axis=1)[:,None]  # Normalize
    
    pcolor(a)
    

    Unsorted random example

    You could then sort the words by the probability that they belong to a cluster:

    maxvi = np.argsort(a,axis=1)
    ii = np.argsort(maxvi[:,-1])
    
    pcolor(a[ii,:])
    

    enter image description here

    Here the word index on the y-axis no longer equals the original ordering since things have been sorted.

    Another possibility is to use the networkx package to plot word clusters for each category, where the words with the highest probability are represented by nodes that are either larger or closer to the center of the graph and ignore those words that have no membership in the category. This might be easier since you have a large number of words and a small number of categories.

    Hopefully one of these suggestions is useful.

    0 讨论(0)
  • 2021-01-05 09:05

    The key thing to consider is whether you have important structure along both dimensions in the matrix. If you do then it's worth trying a colored matrix plot (e.g., imshow), but if your ten topics are basically independent, you're probably better off doing ten individual line or histogram plots. Both plots have advantages and disadvantages.

    In particular, in full matrix plots, the z-axis color values are not very precise or quantitative, so its difficult to see, for example, small ripples on a trend, or quantitative assessments of rates of change, etc, so there's a significant cost to these. And they are also more difficult to pan and zoom since one can get lost and therefore not examine the entire plot, whereas panning along a 1D plot is trivial.

    Also, of course, as others have mentioned, 50K points is too many to actually visualize, so you'll need to sort them, or something, to reduce the number of values that you'll actually need to visually assess.

    In practice though, finding a good visualizing technique for a given data set is not always trivial, and for large and complex data sets, people try everything that has a chance of being helpful, and then choose what actually helps.

    0 讨论(0)
提交回复
热议问题