问题
I am trying to convert spark dataframe to pandas dataframe. I am trying to in Jupyter notebook on EMR. and I am trying following error.
Pandas library is installed on master node under my user. And using spark shell (pyspark) I am able to convert df to padnas df on that master node.
following command has been executed on all the master nodes
pip --no-cache-dir install pandas --user
Following is working on master node. But not from pyspark notebook
import Pandas as pd
Error
No module named 'Pandas'
Traceback (most recent call last):
ModuleNotFoundError: No module named 'Pandas'
Update:
I can run following code from python notebook
import pandas as pd
pd.DataFrame(["a", "b"], columns=['q_data'])
回答1:
You need pandas on the driver node as when converting to pandas df all the data is collected to the driver and then converted
来源:https://stackoverflow.com/questions/62556754/converting-spark-dataframe-to-pandas-dataframe-importerror-pandas-0-19-2-m