How to plot using matplotlib and pandas in pyspark environment?

ε祈祈猫儿з 提交于 2019-12-11 06:46:14

问题


I have a very large pyspark dataframe and I took a sample and convert it into pandas dataframe

sample = heavy_pivot.sample(False, fraction = 0.2, seed = None)
sample_pd = sample.toPandas()

The dataframe looks like this:

sample_pd[['client_id', 'beer_freq']].head(10)


  client_id  beer_freq
0   1000839   0.000000
1   1002185   0.000000
2   1003366   1.000000
3   1005218   1.000000
4   1005483   1.000000
5    100964   0.434783
6    101272   0.166667
7   1017462   0.000000
8   1020561   0.000000
9   1023646   0.000000

I want to plot a histogram of column "beer_freq"

import matplotlib.pyplot as plt
matplotlib.pyplot.switch_backend('agg')

sample_pd.hist('beer_freq', bins = 100)

The plot did not show up... It gives results like this:

 >>>array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f60f6fd0750>]], dtype=object)

It seems like that I cannot write general python code using matplotlib and pandas dataframe to plot figures in pyspark environment.

If I call plt.show() Nothing happens...


回答1:


%matplotlib inline is not supported in Databricks. You can display matplotlib figures using display(). For an example, see https://docs.databricks.com/user-guide/visualizations/matplotlib-and-ggplot.html




回答2:


Try the following:

import matplotlib.pyplot as plt
%matplotlib inline



回答3:


it is not accessible. as Gaurav mentioned, use display() as follow:

col_df = heavy_pivot.select('beer_freq')
display(col_df)

like that, you don't need to change it to pandas dataframe and the final plot looks the same. just after displaying, use the plot button under the output to choose histogram.

source:



来源:https://stackoverflow.com/questions/50240656/how-to-plot-using-matplotlib-and-pandas-in-pyspark-environment

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!