Is there any python library that can be used to just get the schema of a parquet file?
Currently we are loading the parquet file into dataframe in Spark and getting schem
This function returns the schema of a local URI representing a parquet file. The schema is returned as a usable Pandas dataframe. The function does not read the whole file, just the schema.
import pandas as pd
import pyarrow.parquet
def read_parquet_schema_df(uri: str) -> pd.DataFrame:
"""Return a Pandas dataframe corresponding to the schema of a local URI of a parquet file.
The returned dataframe has the columns: column, pa_dtype
"""
# Ref: https://stackoverflow.com/a/64288036/
schema = pyarrow.parquet.read_schema(uri, memory_map=True)
schema = pd.DataFrame(({"column": name, "pa_dtype": str(pa_dtype)} for name, pa_dtype in zip(schema.names, schema.types)))
schema = schema.reindex(columns=["column", "pa_dtype"], fill_value=pd.NA) # Ensures columns in case the parquet file has an empty dataframe.
return schema
It was tested with the following versions of the used third-party packages:
$ pip list | egrep 'pandas|pyarrow'
pandas 1.1.3
pyarrow 1.0.1