问题
I have a very large pyspark dataframe and I took a sample and convert it into pandas dataframe
sample = heavy_pivot.sample(False, fraction = 0.2, seed = None)
sample_pd = sample.toPandas()
The dataframe looks like this:
sample_pd[['client_id', 'beer_freq']].head(10)
client_id beer_freq
0 1000839 0.000000
1 1002185 0.000000
2 1003366 1.000000
3 1005218 1.000000
4 1005483 1.000000
5 100964 0.434783
6 101272 0.166667
7 1017462 0.000000
8 1020561 0.000000
9 1023646 0.000000
I want to plot a histogram of column "beer_freq"
import matplotlib.pyplot as plt
matplotlib.pyplot.switch_backend('agg')
sample_pd.hist('beer_freq', bins = 100)
The plot did not show up... It gives results like this:
>>>array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f60f6fd0750>]], dtype=object)
It seems like that I cannot write general python code using matplotlib and pandas dataframe to plot figures in pyspark environment.
If I call plt.show()
Nothing happens...
回答1:
%matplotlib inline is not supported in Databricks. You can display matplotlib figures using display(). For an example, see https://docs.databricks.com/user-guide/visualizations/matplotlib-and-ggplot.html
回答2:
Try the following:
import matplotlib.pyplot as plt
%matplotlib inline
回答3:
it is not accessible. as Gaurav mentioned, use display()
as follow:
col_df = heavy_pivot.select('beer_freq')
display(col_df)
like that, you don't need to change it to pandas dataframe and the final plot looks the same. just after displaying, use the plot button under the output to choose histogram.
source:
来源:https://stackoverflow.com/questions/50240656/how-to-plot-using-matplotlib-and-pandas-in-pyspark-environment