%matplotlib inline magic command fails to read variables from previous cells in AWS-EMR Jupyterhub Notebook

问题

I'm trying to plot spark dataset using matplotlib after converting it to pandas dataframe in AWS EMR jupyterhub.

I'm able to plot in a single cell using matplotlib like below:

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

df = [1, 1.6, 3, 4.2, 5, 4, 2.5, 3, 1.5]
plt.plot(df)

Now the above code snippet works pretty neatly for me.

After this sample example, I moved ahead to plot my pandas dataframe from a new/multiple cells in AWS-EMR Jupyterhub like this:

-Cell 1-
sparkDS=spark.read.parquet('s3://bucket_name/path').cache()


-Cell 2-
from pyspark.sql.functions import *
sparkDS_groupBy=sparkDS.groupBy('col1').agg(count('*').alias('count')).orderBy('col1')
pandasDF=sparkDS_groupBy.toPandas()


-cell 3-
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

plt.plot(pandasDF)

My code just fails in cell 3 with the following error:

NameError: name 'pandasDF' is not defined

Does anyone have any idea what's wrong?

Why the new cell in my jupyterhub notebook is not able to recognize a variable from the previous cell?

Does it have to do something with the '%matplotlib inline' magic command (I tried with '%matplotlib notebook' also, but failed)?

ps: I'm using AWS 5.19 EMR-Jupyterhub notebook setup for my plotting work.

This error is kind of similar to this one, but not a duplicate How do I make matplotlib work in AWS EMR Jupyter notebook?

回答1:

You'll want to look into the %%spark -o df_name and %%local functions, by typing %%help in a cell.

Specifically, in your case try:

Use %%spark -o sparkDS_groupBy at the start of -Cell 2-,
Start -Cell 3- with %%local,
And plot sparkDS_groupBy in -Cell 3- instead of pandasDF.

For those with less context, you can get plots by implementing the following in an EMR Notebook using PySpark kernel, attached to an EMR cluster that's at least version 5.26.0 (which introduces Notebook-Scoped Libraries.

(each code block represents a Cell)

%% help

%%configure -f
{ "conf":{
"spark.pyspark.python": "python3",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type":"native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"
}}

sc.install_pypi_package("matplotlib")

%%spark -o my_df
# in this cell, my_df is a pyspark.sql.DataFrame
my_df = sc.read.text("s3://.../...")

%%local
%matplotlib inline

import matplotlib.pyplot as plt
# in this cell, my_df is a pandas.DataFrame
plt.plot(my_df)

来源：https://stackoverflow.com/questions/56516346/matplotlib-inline-magic-command-fails-to-read-variables-from-previous-cells-in

标签

amazon-web-services

matplotlib

jupyter-notebook

amazon-emr