Get schema of parquet file in Python

前端 未结 5 1056
误落风尘
误落风尘 2021-02-10 07:23

Is there any python library that can be used to just get the schema of a parquet file?

Currently we are loading the parquet file into dataframe in Spark and getting schem

相关标签:
5条回答
  • 2021-02-10 07:26

    There's now an easiest way with the read_schema method. Note that it returns actually a dict where your schema is a bytes literal, so you need an extra step to convert your schema into a proper python dict.

    from pyarrow.parquet import read_schema
    import json
    
    schema = read_schema(source)
    schema_dict = json.loads(schema.metadata[b'org.apache.spark.sql.parquet.row.metadata'])['fields']
    
    0 讨论(0)
  • 2021-02-10 07:29

    This function returns the schema of a local URI representing a parquet file. The schema is returned as a usable Pandas dataframe. The function does not read the whole file, just the schema.

    import pandas as pd
    import pyarrow.parquet
    
    
    def read_parquet_schema_df(uri: str) -> pd.DataFrame:
        """Return a Pandas dataframe corresponding to the schema of a local URI of a parquet file.
    
        The returned dataframe has the columns: column, pa_dtype
        """
        # Ref: https://stackoverflow.com/a/64288036/
        schema = pyarrow.parquet.read_schema(uri, memory_map=True)
        schema = pd.DataFrame(({"column": name, "pa_dtype": str(pa_dtype)} for name, pa_dtype in zip(schema.names, schema.types)))
        schema = schema.reindex(columns=["column", "pa_dtype"], fill_value=pd.NA)  # Ensures columns in case the parquet file has an empty dataframe.
        return schema
    

    It was tested with the following versions of the used third-party packages:

    $ pip list | egrep 'pandas|pyarrow'
    pandas             1.1.3
    pyarrow            1.0.1
    
    0 讨论(0)
  • 2021-02-10 07:35

    This is supported by using pyarrow (https://github.com/apache/arrow/).

    from pyarrow.parquet import ParquetFile
    # Source is either the filename or an Arrow file handle (which could be on HDFS)
    ParquetFile(source).metadata
    

    Note: We merged the code for this only yesterday, so you need to build it from source, see https://github.com/apache/arrow/commit/f44b6a3b91a15461804dd7877840a557caa52e4e

    0 讨论(0)
  • 2021-02-10 07:48

    In addition to the answer by @mehdio, in case your parquet is a directory (e.g. a parquet generated by spark), to read the schema / column names:

    import pyarrow.parquet as pq
    pfile = pq.read_table("file.parquet")
    print("Column names: {}".format(pfile.column_names))
    print("Schema: {}".format(pfile.schema))
    
    0 讨论(0)
  • 2021-02-10 07:54

    As other commentors have mentioned, PyArrow is the easiest way to grab the schema of a Parquet file with Python. My answer goes into more detail about the schema that's returned by PyArrow and the metadata that's stored in Parquet files.

    import pyarrow.parquet as pq
    
    table = pq.read_table(path)
    table.schema # returns the schema
    

    Here's how to create a PyArrow schema (this is the object that's returned by table.schema):

    import pyarrow as pa
    
    pa.schema([
        pa.field("id", pa.int64(), True),
        pa.field("last_name", pa.string(), True),
        pa.field("position", pa.string(), True)])
    

    Each PyArrow Field has name, type, nullable, and metadata properties. See here for more details on how to write custom file / column metadata to Parquet files with PyArrow.

    The type property is for PyArrow DataType objects. pa.int64() and pa.string() are examples of PyArrow DataTypes.

    Make sure you understand about column level metadata like min / max. That'll help you understand some of the cool features like predicate pushdown filtering that Parquet files allow for in big data systems.

    0 讨论(0)
提交回复
热议问题