问题
I am working on a fairly large dataset (~40m rows). I have found that if I call sns.countplot() directly then my visualisation plots really quickly:
%%time
ax = sns.countplot(x="age_band",data=acme)
However if I do the same visualisation using catplot(kind="count") then the speed of execution slows down dramatically:
%%time
g = sns.catplot(x="age_band",data=acme,kind="count")
Is there a reason for such a large performance difference? Is catplot() doing some sort of conversion on my data before it can plot it?
If there is a known reason for this, then does it extend to all figure level functions vs axis level functions eg is sns.scatterplot() faster that sns.relplot(kind="scatter") etc?
My preference would be to use catplot() as I like its flexibility and easy plotting on a FacetGrid but if it is going to take so much longer to achieve the same plot then I will just use the axis level functions directly.
回答1:
There is a lot of overhead in catplot
, or for that matter in FacetGrid
, that will ensure that the categories are synchronized along the grid. Consider e.g. that you have a variable you plot along the columns of the grid for which not every age group occurs. You would still need to show that non-occuring age group and hold on to its color. Hence, two countplots next to each other do not necessarily make up one catplot.
However, if you are only interested in a single countplot, a catplot is clearly overkill. On the other hand, even a single countplot is overkill compared to a barplot of the counts. That is,
counts = df["Category"].value_counts().sort_index()
colors = plt.cm.tab10(np.arange(len(counts)))
ax = counts.plot.bar(color=colors)
will be twice as fast as
ax = sns.countplot(x="Category", data=df)
来源:https://stackoverflow.com/questions/57990852/catplotkind-count-is-significantly-slower-than-countplot