I am able to run this script to save the file in text format, but when I try to run saveAsSequenceFile it is erroring out. If any one have idea about how to save the RDD as sequence file, please let me know the process. I tried looking for solution in "Learning Spark" as well as official Spark documentation.
This runs successfully
dataRDD = sc.textFile("/user/cloudera/sqoop_import/departments")
dataRDD.saveAsTextFile("/user/cloudera/pyspark/departments")
This fails
dataRDD = sc.textFile("/user/cloudera/sqoop_import/departments")
dataRDD.saveAsSequenceFile("/user/cloudera/pyspark/departmentsSeq")
Error: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.saveAsSequenceFile. : org.apache.spark.SparkException: RDD element of type java.lang.String cannot be used
Here is the data:
2,Fitness
3,Footwear
4,Apparel
5,Golf
6,Outdoors
7,Fan Shop
8,TESTING
8000,TESTING
Sequence files are used to store key-value pairs so you cannot simply store RDD[String]
. Given your data I guess you're looking for something like this:
rdd = sc.parallelize([
"2,Fitness", "3,Footwear", "4,Apparel"
])
rdd.map(lambda x: tuple(x.split(",", 1))).saveAsSequenceFile("testSeq")
If you want to keep whole strings just use None
keys:
rdd.map(lambda x: (None, x)).saveAsSequenceFile("testSeqNone")
To write to Sequence file you need the data in format of Hadoop API.
String as Text
Int as IntWritable
In Python :
data = [(1, ""),(1, "a"),(2, "bcdf")]
sc.parallelize(data).saveAsNewAPIHadoopFile(path,"org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat","org.apache.hadoop.io.IntWritable","org.apache.hadoop.io.Text")
来源:https://stackoverflow.com/questions/34491579/saving-rdd-as-sequence-file-in-pyspark