问题
I'm trying to plot spark dataset using matplotlib after converting it to pandas dataframe in AWS EMR jupyterhub.
I'm able to plot in a single cell using matplotlib like below:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
df = [1, 1.6, 3, 4.2, 5, 4, 2.5, 3, 1.5]
plt.plot(df)
Now the above code snippet works pretty neatly for me.
After this sample example, I moved ahead to plot my pandas dataframe from a new/multiple cells in AWS-EMR Jupyterhub like this:
-Cell 1-
sparkDS=spark.read.parquet('s3://bucket_name/path').cache()
-Cell 2-
from pyspark.sql.functions import *
sparkDS_groupBy=sparkDS.groupBy('col1').agg(count('*').alias('count')).orderBy('col1')
pandasDF=sparkDS_groupBy.toPandas()
-cell 3-
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.plot(pandasDF)
My code just fails in cell 3 with the following error:
NameError: name 'pandasDF' is not defined
Does anyone have any idea what's wrong?
Why the new cell in my jupyterhub notebook is not able to recognize a variable from the previous cell?
Does it have to do something with the '%matplotlib inline' magic command (I tried with '%matplotlib notebook' also, but failed)?
ps: I'm using AWS 5.19 EMR-Jupyterhub notebook setup for my plotting work.
This error is kind of similar to this one, but not a duplicate How do I make matplotlib work in AWS EMR Jupyter notebook?
回答1:
You'll want to look into the %%spark -o df_name
and %%local
functions, by typing %%help
in a cell.
Specifically, in your case try:
- Use
%%spark -o sparkDS_groupBy
at the start of-Cell 2-
, - Start
-Cell 3-
with%%local
, - And plot
sparkDS_groupBy
in-Cell 3-
instead ofpandasDF
.
For those with less context, you can get plots by implementing the following in an EMR Notebook using PySpark kernel, attached to an EMR cluster that's at least version 5.26.0 (which introduces Notebook-Scoped Libraries.
(each code block represents a Cell)
%% help
%%configure -f
{ "conf":{
"spark.pyspark.python": "python3",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type":"native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"
}}
sc.install_pypi_package("matplotlib")
%%spark -o my_df
# in this cell, my_df is a pyspark.sql.DataFrame
my_df = sc.read.text("s3://.../...")
%%local
%matplotlib inline
import matplotlib.pyplot as plt
# in this cell, my_df is a pandas.DataFrame
plt.plot(my_df)
来源:https://stackoverflow.com/questions/56516346/matplotlib-inline-magic-command-fails-to-read-variables-from-previous-cells-in