How to extract column name and column type from SQL in pyspark

心已入冬 提交于 2021-02-08 10:01:41

问题


The Spark SQL for Create query is like this -

CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name
  [(col_name1 col_type1 [COMMENT col_comment1], ...)]
  USING datasource
  [OPTIONS (key1=val1, key2=val2, ...)]
  [PARTITIONED BY (col_name1, col_name2, ...)]
  [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS]
  [LOCATION path]
  [COMMENT table_comment]
  [TBLPROPERTIES (key1=val1, key2=val2, ...)]
  [AS select_statement]

where [x] means x is optional. I want the output as a tuple of following order if a CREATE sql query is passed -

(db_name, table_name, [(col1 name, col1 type), (col2 name, col2 type), ...])

So is there any way to do that with pyspark sql functions or need help from regex?

If regex could anyone please help with the regular expression?


回答1:


It can be done by accessing the unofficial API through java_gateway:

plan = spark_session._jsparkSession.sessionState().sqlParser().parsePlan("CREATE TABLE foobar.test (foo INT, bar STRING) USING json")
print(f"database: {plan.tableDesc().identifier().database().get()}")
print(f"table: {plan.tableDesc().identifier().table()}")
# perhaps there is a better way to convert the schemas, using JSON string hack here
print(f"schema: {StructType.fromJson(json.loads(plan.tableDesc().schema().json()))}")

Output:

database: foobar
table: test
schema: StructType(List(StructField(foo,IntegerType,true),StructField(bar,StringType,true)))

Note that database().get() would fail if the database is not defined and the Scala option should be handled properly. Also, if you use CREATE TEMPORARY VIEW the accessors are named differently. The commands can be found here https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L38 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L58



来源:https://stackoverflow.com/questions/56383966/how-to-extract-column-name-and-column-type-from-sql-in-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!