发表新帖

发表新帖

Get schema of parquet file in Python

前端未结

关注

 5  1066

误落风尘 2021-02-10 07:23

Is there any python library that can be used to just get the schema of a parquet file?

Currently we are loading the parquet file into dataframe in Spark and getting schem

5条回答

清酒与你 (楼主)

2021-02-10 07:54
As other commentors have mentioned, PyArrow is the easiest way to grab the schema of a Parquet file with Python. My answer goes into more detail about the schema that's returned by PyArrow and the metadata that's stored in Parquet files.
```
import pyarrow.parquet as pq

table = pq.read_table(path)
table.schema # returns the schema
```
Here's how to create a PyArrow schema (this is the object that's returned by table.schema):
```
import pyarrow as pa

pa.schema([
    pa.field("id", pa.int64(), True),
    pa.field("last_name", pa.string(), True),
    pa.field("position", pa.string(), True)])
```
Each PyArrow Field has name, type, nullable, and metadata properties. See here for more details on how to write custom file / column metadata to Parquet files with PyArrow.

The type property is for PyArrow DataType objects. pa.int64() and pa.string() are examples of PyArrow DataTypes.

Make sure you understand about column level metadata like min / max. That'll help you understand some of the cool features like predicate pushdown filtering that Parquet files allow for in big data systems.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...

热议问题