I am trying to create an empty dataframe in Spark (Pyspark).
I am using similar approach to the one discussed here enter link description here, but it is not working.
You can create an empty data frame by using following syntax in pyspark:
df = spark.createDataFrame([], ["col1", "col2", ...])
where []
represents the empty value for col1
and col2
. Then you can register as temp view for your sql queries:
**df2.createOrReplaceTempView("artist")**
Seq.empty[String].toDF()
This will create a empty df. Helpful for testing purposes and all. (Scala-Spark)
This is a roundabout but simple way to create an empty spark df with an inferred schema
# Initialize a spark df using one row of data with the desired schema
init_sdf = spark.createDataFrame([('a_string', 0, 0)], ['name', 'index', 'seq_#'])
# remove the row. Leaves the schema
empty_sdf = init_sdf.where(col('name') == 'not_match')
empty_sdf.printSchema()
# Output
root
|-- name: string (nullable = true)
|-- index: long (nullable = true)
|-- seq_#: long (nullable = true)
You can do it by loading an empty file (parquet
, json
etc.) like this:
df = sqlContext.read.json("my_empty_file.json")
Then when you try to check the schema you'll see:
>>> df.printSchema()
root
In Scala/Java not passing a path should work too, in Python it throws an exception. Also if you ever switch to Scala/Python you can use this method to create one.
extending Joe Widen's answer, you can actually create the schema with no fields like so:
schema = StructType([])
so when you create the DataFrame using that as your schema, you'll end up with a DataFrame[]
.
>>> empty = sqlContext.createDataFrame(sc.emptyRDD(), schema)
DataFrame[]
>>> empty.schema
StructType(List())
In Scala, if you choose to use sqlContext.emptyDataFrame
and check out the schema, it will return StructType()
.
scala> val empty = sqlContext.emptyDataFrame
empty: org.apache.spark.sql.DataFrame = []
scala> empty.schema
res2: org.apache.spark.sql.types.StructType = StructType()