Reading Sequence File in PySpark 2.0

后端 未结 2 1857
心在旅途
心在旅途 2021-01-16 11:43

I have a sequence file whose values look like

(string_value, json_value)

I don\'t care about the string value.

In Scala I can read

2条回答
  •  广开言路
    2021-01-16 12:18

    For Spark 2.4.x, you've to get the sparkContext object from SparkSession (spark object). Which has the sequenceFile API to read Sequence Files.

    spark.
    sparkContext.
    sequenceFile('/user/sequencesample').
    toDF().show()
    

    Above one works like a charm.

    For writing (parquet to sequenceFile):

    spark.
    read.
    format('parquet').
    load('/user/parquet_sample').
    select('id',F.concat_ws('|','id','name')).
    rdd.map(lambda rec:(rec[0],rec[1])).
    saveAsSequenceFile('/user/output')
    

    First convert DF to RDD and create a tuple of (Key,Value) pair before saving as SequenceFile.

    I hope this answer helps your purpose.

提交回复
热议问题