I am using two Jupyter notebooks to do different things in an analysis. In my Scala notebook, I write some of my cleaned data to parquet:
partitionedDF.select(\
You can use parquet
format of Spark Session to read parquet files. Like this:
df = spark.read.parquet("swift2d://xxxx.keystone/commentClusters.parquet")
Although, there is no difference between parquet
and load
functions. It might be the case that load
is not able to infer the schema of data in the file (eg, some data type which is not identifiable by load
or specific to parquet
).
I read parquet file in the following way:
from pyspark.sql import SparkSession
# initialise sparkContext
spark = SparkSession.builder \
.master('local') \
.appName('myAppName') \
.config('spark.executor.memory', '5gb') \
.config("spark.cores.max", "6") \
.getOrCreate()
sc = spark.sparkContext
# using SQLContext to read parquet file
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
# to read parquet file
df = sqlContext.read.parquet('path-to-file/commentClusters.parquet')