问题
Is that possible to pipe Spark RDD to Python?
Because I need a python library to do some calculation on my data, but my main Spark project is based on Scala. Is there a way to mix them both or let python access the same spark context?
回答1:
You can indeed pipe out to a python script using Scala and Spark and a regular Python script.
test.py
#!/usr/bin/python
import sys
for line in sys.stdin:
print "hello " + line
spark-shell (scala)
val data = List("john","paul","george","ringo")
val dataRDD = sc.makeRDD(data)
val scriptPath = "./test.py"
val pipeRDD = dataRDD.pipe(scriptPath)
pipeRDD.foreach(println)
Output
hello john
hello ringo
hello george
hello paul
回答2:
You can run the Python code via Pipe in Spark.
With pipe(), you can write a transformation of an RDD that reads each RDD element from standard input as String, manipulates that String as per script instruction, and then writes the result as String to standard output.
SparkContext.addFile(path), we can add up list of files for each of the worker nodes to download when a Spark job starts.All the worker node will have their copy of the script thus we will be getting parallel operation by pipe. We need to install all the libraries and dependency prior to it on all the worker and executor node.
Example :
Python File : Code for making Input data to Uppercase
#!/usr/bin/python
import sys
for line in sys.stdin:
print line.upper()
Spark Code : For Piping the data
val conf = new SparkConf().setAppName("Pipe")
val sc = new SparkContext(conf)
val distScript = "/path/on/driver/PipeScript.py"
val distScriptName = "PipeScript.py"
sc.addFile(distScript)
val ipData = sc.parallelize(List("asd","xyz","zxcz","sdfsfd","Ssdfd","Sdfsf"))
val opData = ipData.pipe(SparkFiles.get(distScriptName))
opData.foreach(println)
回答3:
If I understand you correctly, as long as you take the data from scala
and covert it to RDD or SparkContext then you'll be able to use pyspark to manipulate the data using Spark Python API.
There's also a programming guide that you can follow to utilize the different languages within spark
来源:https://stackoverflow.com/questions/32975636/how-to-use-both-scala-and-python-in-a-same-spark-project