%matplotlib inline magic command fails to read variables from previous cells in AWS-EMR Jupyterhub Notebook

我是研究僧i 提交于 2020-02-02 13:33:20

问题


I'm trying to plot spark dataset using matplotlib after converting it to pandas dataframe in AWS EMR jupyterhub.

I'm able to plot in a single cell using matplotlib like below:

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

df = [1, 1.6, 3, 4.2, 5, 4, 2.5, 3, 1.5]
plt.plot(df)

Now the above code snippet works pretty neatly for me.

After this sample example, I moved ahead to plot my pandas dataframe from a new/multiple cells in AWS-EMR Jupyterhub like this:

-Cell 1-
sparkDS=spark.read.parquet('s3://bucket_name/path').cache()


-Cell 2-
from pyspark.sql.functions import *
sparkDS_groupBy=sparkDS.groupBy('col1').agg(count('*').alias('count')).orderBy('col1')
pandasDF=sparkDS_groupBy.toPandas()


-cell 3-
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

plt.plot(pandasDF)

My code just fails in cell 3 with the following error:

NameError: name 'pandasDF' is not defined

Does anyone have any idea what's wrong?

Why the new cell in my jupyterhub notebook is not able to recognize a variable from the previous cell?

Does it have to do something with the '%matplotlib inline' magic command (I tried with '%matplotlib notebook' also, but failed)?

ps: I'm using AWS 5.19 EMR-Jupyterhub notebook setup for my plotting work.

This error is kind of similar to this one, but not a duplicate How do I make matplotlib work in AWS EMR Jupyter notebook?


回答1:


You'll want to look into the %%spark -o df_name and %%local functions, by typing %%help in a cell.

Specifically, in your case try:

  1. Use %%spark -o sparkDS_groupBy at the start of -Cell 2-,
  2. Start -Cell 3- with %%local,
  3. And plot sparkDS_groupBy in -Cell 3- instead of pandasDF.

For those with less context, you can get plots by implementing the following in an EMR Notebook using PySpark kernel, attached to an EMR cluster that's at least version 5.26.0 (which introduces Notebook-Scoped Libraries.

(each code block represents a Cell)

%% help
%%configure -f
{ "conf":{
"spark.pyspark.python": "python3",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type":"native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"
}}
sc.install_pypi_package("matplotlib")
%%spark -o my_df
# in this cell, my_df is a pyspark.sql.DataFrame
my_df = sc.read.text("s3://.../...")
%%local
%matplotlib inline

import matplotlib.pyplot as plt
# in this cell, my_df is a pandas.DataFrame
plt.plot(my_df)


来源:https://stackoverflow.com/questions/56516346/matplotlib-inline-magic-command-fails-to-read-variables-from-previous-cells-in

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!