How do I read a parquet in PySpark written from Spark?

前端 未结 2 908
无人及你
无人及你 2021-01-31 03:36

I am using two Jupyter notebooks to do different things in an analysis. In my Scala notebook, I write some of my cleaned data to parquet:

partitionedDF.select(\         


        
相关标签:
2条回答
  • 2021-01-31 04:08

    You can use parquet format of Spark Session to read parquet files. Like this:

    df = spark.read.parquet("swift2d://xxxx.keystone/commentClusters.parquet")
    

    Although, there is no difference between parquet and load functions. It might be the case that load is not able to infer the schema of data in the file (eg, some data type which is not identifiable by load or specific to parquet).

    0 讨论(0)
  • 2021-01-31 04:11

    I read parquet file in the following way:

    from pyspark.sql import SparkSession
    # initialise sparkContext
    spark = SparkSession.builder \
        .master('local') \
        .appName('myAppName') \
        .config('spark.executor.memory', '5gb') \
        .config("spark.cores.max", "6") \
        .getOrCreate()
    
    sc = spark.sparkContext
    
    # using SQLContext to read parquet file
    from pyspark.sql import SQLContext
    sqlContext = SQLContext(sc)
    
    # to read parquet file
    df = sqlContext.read.parquet('path-to-file/commentClusters.parquet')
    
    0 讨论(0)
提交回复
热议问题