i am trying to work with Pyspark in IntelliJ but i cannot figure out how to correctly install it/setup the project. I can work with Python in IntelliJ and I can use the pyspark
For example, something of this kind:
from pyspark import SparkContext, SparkConf
spark_conf = SparkConf().setAppName("scavenge some logs")
spark_context = SparkContext(conf=spark_conf)
address = "/path/to/the/log/on/hdfs/*.gz"
log = spark_context.textFile(address)
my_result = (log.
...here go your actions and transformations...
).saveAsTextFile('my_result')
Set the env path for (SPARK_HOME
and PYTHONPATH
) in your program run/debug
configuration.
For instance:
SPARK_HOME=/Users/<username>/javalibs/spark-1.5.0-bin-hadoop2.4/python/
PYTHON_PATH=/Users/<username>/javalibs/spark-1.5.0-bin-hadoop2.4/python/pyspark
See attached snapshot in IntelliJ Idea
1 problem I encountered was space as in 'Program Files\spark' for environment variables SPARK_HOME and PYTHONPATH (as stated above) so I moved spark binaries to my user directory instead. Thanks to this answer. Also, make sure you installed the packages for the environment. Ensure you see pyspark package in Project Structure -> Platform Settings SDK -> Python SDK (of choice) -> Packages.