Transfer and write Parquet with python and pandas got timestamp error

前端未结

关注

 5  2280

春和景丽 2021-02-19 00:44

I tried to concat() two parquet file with pandas in python .
It can work , but when I try to write and save the Data frame to a parquet file ,it display the error :

5条回答

被撕碎了的回忆 (楼主)

2021-02-19 01:34

I experienced a related order-of-magnitude problem when writing dask DataFrames with datetime64[ns] columns to AWS S3 and crawling them into Athena tables.

The problem was that subsequent Athena queries showed the datetime fields as year >57000 instead of 2020. I managed to use the following fix:

df.to_parquet(path, times="int96")

Which forwards the kwarg **{"times": "int96"} into fastparquet.writer.write().

I checked the resulting parquet file using package parquet-tools. It indeed shows the datetime columns as INT96 storage format. On Athena (which is based on Presto) the int96 format is well supported and does not have the order of magnitude problem.

Reference: https://github.com/dask/fastparquet/blob/master/fastparquet/writer.py, function write(), kwarg times. (dask 2.30.0 ; fastparquet 0.4.1 ; pandas 1.1.4)

0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...