How to Use both Scala and Python in a same Spark project?

那年仲夏 提交于 2019-12-17 10:30:51

问题


Is that possible to pipe Spark RDD to Python?

Because I need a python library to do some calculation on my data, but my main Spark project is based on Scala. Is there a way to mix them both or let python access the same spark context?


回答1:


You can indeed pipe out to a python script using Scala and Spark and a regular Python script.

test.py

#!/usr/bin/python

import sys

for line in sys.stdin:
  print "hello " + line

spark-shell (scala)

val data = List("john","paul","george","ringo")

val dataRDD = sc.makeRDD(data)

val scriptPath = "./test.py"

val pipeRDD = dataRDD.pipe(scriptPath)

pipeRDD.foreach(println)

Output

hello john

hello ringo

hello george

hello paul




回答2:


You can run the Python code via Pipe in Spark.

With pipe(), you can write a transformation of an RDD that reads each RDD element from standard input as String, manipulates that String as per script instruction, and then writes the result as String to standard output.

SparkContext.addFile(path), we can add up list of files for each of the worker nodes to download when a Spark job starts.All the worker node will have their copy of the script thus we will be getting parallel operation by pipe. We need to install all the libraries and dependency prior to it on all the worker and executor node.

Example :

Python File : Code for making Input data to Uppercase

#!/usr/bin/python
import sys
for line in sys.stdin:
    print line.upper()

Spark Code : For Piping the data

val conf = new SparkConf().setAppName("Pipe")
val sc = new SparkContext(conf)
val distScript = "/path/on/driver/PipeScript.py"
val distScriptName = "PipeScript.py"
sc.addFile(distScript)
val ipData = sc.parallelize(List("asd","xyz","zxcz","sdfsfd","Ssdfd","Sdfsf"))
val opData = ipData.pipe(SparkFiles.get(distScriptName))
opData.foreach(println)



回答3:


If I understand you correctly, as long as you take the data from scala and covert it to RDD or SparkContext then you'll be able to use pyspark to manipulate the data using Spark Python API.

There's also a programming guide that you can follow to utilize the different languages within spark



来源:https://stackoverflow.com/questions/32975636/how-to-use-both-scala-and-python-in-a-same-spark-project

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!