Get schema of parquet file in Python

前端 未结 5 1069
误落风尘
误落风尘 2021-02-10 07:23

Is there any python library that can be used to just get the schema of a parquet file?

Currently we are loading the parquet file into dataframe in Spark and getting schem

5条回答
  •  一个人的身影
    2021-02-10 07:29

    This function returns the schema of a local URI representing a parquet file. The schema is returned as a usable Pandas dataframe. The function does not read the whole file, just the schema.

    import pandas as pd
    import pyarrow.parquet
    
    
    def read_parquet_schema_df(uri: str) -> pd.DataFrame:
        """Return a Pandas dataframe corresponding to the schema of a local URI of a parquet file.
    
        The returned dataframe has the columns: column, pa_dtype
        """
        # Ref: https://stackoverflow.com/a/64288036/
        schema = pyarrow.parquet.read_schema(uri, memory_map=True)
        schema = pd.DataFrame(({"column": name, "pa_dtype": str(pa_dtype)} for name, pa_dtype in zip(schema.names, schema.types)))
        schema = schema.reindex(columns=["column", "pa_dtype"], fill_value=pd.NA)  # Ensures columns in case the parquet file has an empty dataframe.
        return schema
    

    It was tested with the following versions of the used third-party packages:

    $ pip list | egrep 'pandas|pyarrow'
    pandas             1.1.3
    pyarrow            1.0.1
    

提交回复
热议问题