Scatter plot with a huge amount of data

前端 未结 3 2167
后悔当初
后悔当初 2020-12-14 09:10

I would like to use Matplotlib to generate a scatter plot with a huge amount of data (about 3 million points). Actually I\'ve 3 vectors with the same dimension and I use to

相关标签:
3条回答
  • 2020-12-14 09:26

    What about trying pyplot.hexbin? It generates a sort of heatmap based on point density in a set number of bins.

    0 讨论(0)
  • 2020-12-14 09:41

    Unless your graphic is huge, many of those 3 million points are going to overlap. (A 400x600 image only has 240K dots...)

    So the easiest thing to do would be to take a sample of say, 1000 points, from your data:

    import random
    delta_sample=random.sample(delta,1000)
    

    and just plot that.

    For example:

    import matplotlib.pyplot as plt
    import matplotlib.cm as cm
    import numpy as np
    import random
    
    fig = plt.figure()
    fig.subplots_adjust(bottom=0.2)
    ax = fig.add_subplot(111)
    
    N=3*10**6
    delta=np.random.normal(size=N)
    vf=np.random.normal(size=N)
    dS=np.random.normal(size=N)
    
    idx=random.sample(range(N),1000)
    
    plt.scatter(delta[idx],vf[idx],c=dS[idx],alpha=0.7,cmap=cm.Paired)
    plt.show()
    

    alt text

    Or, if you need to pay more attention to outliers, then perhaps you could bin your data using np.histogram, and then compose a delta_sample which has representatives from each bin.

    Unfortunately, when using np.histogram I don't think there is any easy way to associate bins with individual data points. A simple, but approximate solution is to use the location of a point in or on the bin edge itself as a proxy for the points in it:

    xedges=np.linspace(-10,10,100)
    yedges=np.linspace(-10,10,100)
    zedges=np.linspace(-10,10,10)
    hist,edges=np.histogramdd((delta,vf,dS), (xedges,yedges,zedges))
    xidx,yidx,zidx=np.where(hist>0)
    plt.scatter(xedges[xidx],yedges[yidx],c=zedges[zidx],alpha=0.7,cmap=cm.Paired)
    plt.show()
    

    alt text

    0 讨论(0)
  • 2020-12-14 09:45

    You could take the heatmap approach shown here. In this example the color represents the quantity of data in the bin, not the median value of the dS array, but that should be easy to change. More later if you are interested.

    0 讨论(0)
提交回复
热议问题