Is there any python library that can be used to just get the schema of a parquet file?
Currently we are loading the parquet file into dataframe in Spark and getting schem
There's now an easiest way with the read_schema
method. Note that it returns actually a dict where your schema is a bytes literal, so you need an extra step to convert your schema into a proper python dict.
from pyarrow.parquet import read_schema
import json
schema = read_schema(source)
schema_dict = json.loads(schema.metadata[b'org.apache.spark.sql.parquet.row.metadata'])['fields']