How to get the schema definition from a dataframe in PySpark?

前端 未结 4 2035
萌比男神i
萌比男神i 2021-02-12 14:25

In PySpark it you can define a schema and read data sources with this pre-defined schema, e. g.:

Schema = StructType([ Str         


        
相关标签:
4条回答
  • 2021-02-12 14:29

    If you are looking for a DDL string from PySpark:

    df: DataFrame = spark.read.load('LOCATION')
    schema_json = df.schema.json()
    ddl = spark.sparkContext._jvm.org.apache.spark.sql.types.DataType.fromJson(schema_json).toDDL()
    
    0 讨论(0)
  • 2021-02-12 14:32

    The code below will give you a well formatted tabular schema definition of the known dataframe. Quite useful when you have very huge number of columns & where editing is cumbersome. You can then now apply it to your new dataframe & hand-edit any columns you may want to accordingly.

    from pyspark.sql.types import StructType
    
    schema = [i for i in df.schema] 
    

    And then from here, you have your new schema:

    NewSchema = StructType(schema)
    
    0 讨论(0)
  • 2021-02-12 14:37

    You could re-use schema for existing Dataframe

    l = [('Ankita',25,'F'),('Jalfaizy',22,'M'),('saurabh',20,'M'),('Bala',26,None)]
    people_rdd=spark.sparkContext.parallelize(l)
    schemaPeople = people_rdd.toDF(['name','age','gender'])
    
    schemaPeople.show()
    
    +--------+---+------+
    |    name|age|gender|
    +--------+---+------+
    |  Ankita| 25|     F|
    |Jalfaizy| 22|     M|
    | saurabh| 20|     M|
    |    Bala| 26|  null|
    +--------+---+------+
    
    spark.createDataFrame(people_rdd,schemaPeople.schema).show()
    
    +--------+---+------+
    |    name|age|gender|
    +--------+---+------+
    |  Ankita| 25|     F|
    |Jalfaizy| 22|     M|
    | saurabh| 20|     M|
    |    Bala| 26|  null|
    +--------+---+------+
    

    Just use df.schema to get the underlying schema of dataframe

    schemaPeople.schema
    
    StructType(List(StructField(name,StringType,true),StructField(age,LongType,true),StructField(gender,StringType,true)))
    
    0 讨论(0)
  • 2021-02-12 14:52

    Yes it is possible. Use DataFrame.schema property

    schema

    Returns the schema of this DataFrame as a pyspark.sql.types.StructType.

    >>> df.schema
    StructType(List(StructField(age,IntegerType,true),StructField(name,StringType,true)))
    

    New in version 1.3.

    Schema can be also exported to JSON and imported back if needed.

    0 讨论(0)
提交回复
热议问题