How to plot using matplotlib and pandas in pyspark environment?

问题

I have a very large pyspark dataframe and I took a sample and convert it into pandas dataframe

sample = heavy_pivot.sample(False, fraction = 0.2, seed = None)
sample_pd = sample.toPandas()

The dataframe looks like this:

sample_pd[['client_id', 'beer_freq']].head(10)


  client_id  beer_freq
0   1000839   0.000000
1   1002185   0.000000
2   1003366   1.000000
3   1005218   1.000000
4   1005483   1.000000
5    100964   0.434783
6    101272   0.166667
7   1017462   0.000000
8   1020561   0.000000
9   1023646   0.000000

I want to plot a histogram of column "beer_freq"

import matplotlib.pyplot as plt
matplotlib.pyplot.switch_backend('agg')

sample_pd.hist('beer_freq', bins = 100)

The plot did not show up... It gives results like this:

 >>>array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f60f6fd0750>]], dtype=object)

It seems like that I cannot write general python code using matplotlib and pandas dataframe to plot figures in pyspark environment.

If I call plt.show() Nothing happens...

回答1:

%matplotlib inline is not supported in Databricks. You can display matplotlib figures using display(). For an example, see https://docs.databricks.com/user-guide/visualizations/matplotlib-and-ggplot.html

回答2:

Try the following:

import matplotlib.pyplot as plt
%matplotlib inline

回答3:

it is not accessible. as Gaurav mentioned, use display() as follow:

col_df = heavy_pivot.select('beer_freq')
display(col_df)

like that, you don't need to change it to pandas dataframe and the final plot looks the same. just after displaying, use the plot button under the output to choose histogram.

source:

来源：https://stackoverflow.com/questions/50240656/how-to-plot-using-matplotlib-and-pandas-in-pyspark-environment

标签

pandas

apache-spark

matplotlib

pyspark

pyspark-sql