How to extract column name and column type from SQL in pyspark

问题

The Spark SQL for Create query is like this -

CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name
  [(col_name1 col_type1 [COMMENT col_comment1], ...)]
  USING datasource
  [OPTIONS (key1=val1, key2=val2, ...)]
  [PARTITIONED BY (col_name1, col_name2, ...)]
  [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS]
  [LOCATION path]
  [COMMENT table_comment]
  [TBLPROPERTIES (key1=val1, key2=val2, ...)]
  [AS select_statement]

where [x] means x is optional. I want the output as a tuple of following order if a CREATE sql query is passed -

(db_name, table_name, [(col1 name, col1 type), (col2 name, col2 type), ...])

So is there any way to do that with pyspark sql functions or need help from regex?

If regex could anyone please help with the regular expression?

回答1:

It can be done by accessing the unofficial API through java_gateway:

plan = spark_session._jsparkSession.sessionState().sqlParser().parsePlan("CREATE TABLE foobar.test (foo INT, bar STRING) USING json")
print(f"database: {plan.tableDesc().identifier().database().get()}")
print(f"table: {plan.tableDesc().identifier().table()}")
# perhaps there is a better way to convert the schemas, using JSON string hack here
print(f"schema: {StructType.fromJson(json.loads(plan.tableDesc().schema().json()))}")

Output:

database: foobar
table: test
schema: StructType(List(StructField(foo,IntegerType,true),StructField(bar,StringType,true)))

Note that database().get() would fail if the database is not defined and the Scala option should be handled properly. Also, if you use CREATE TEMPORARY VIEW the accessors are named differently. The commands can be found here https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L38 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L58

来源：https://stackoverflow.com/questions/56383966/how-to-extract-column-name-and-column-type-from-sql-in-pyspark

标签

python

sql

apache-spark

pyspark

pyspark-sql