I am trying to create an empty dataframe in Spark (Pyspark).
I am using similar approach to the one discussed here enter link description here, but it is not working.
This will work with spark version 2.0.0 or more
from pyspark.sql import SQLContext
sc = spark.sparkContext
schema = StructType([StructField('col1', StringType(), False),StructField('col2', IntegerType(), True)])
sqlContext.createDataFrame(sc.emptyRDD(), schema)
If you want an empty dataframe based on an existing one, simple limit rows to 0. In PySpark :
emptyDf = existingDf.limit(0)
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType
spark = SparkSession.builder.appName('SparkPractice').getOrCreate()
schema = StructType([
StructField('firstname', StringType(), True),
StructField('middlename', StringType(), True),
StructField('lastname', StringType(), True)
])
df = spark.createDataFrame(spark.sparkContext.emptyRDD(),schema)
df.printSchema()
spark.range(0).drop("id")
This creates a DataFrame with an "id" column and no rows then drops the "id" column, leaving you with a truly empty DataFrame.
At the time this answer was written it looks like you need some sort of schema
from pyspark.sql.types import *
field = [StructField("field1", StringType(), True)]
schema = StructType(field)
sc = spark.sparkContext
sqlContext.createDataFrame(sc.emptyRDD(), schema)
You can just use something like this:
pivot_table = sparkSession.createDataFrame([("99","99")], ["col1","col2"])