Transfer and write Parquet with python and pandas got timestamp error

前端未结

关注

 5  2282

I tried to concat() two parquet file with pandas in python .
It can work , but when I try to write and save the Data frame to a parquet file ,it display the error :

相关标签:

5条回答

盖世英雄少女心

2021-02-19 01:08
I think this is a bug and you should do what Wes says. However, if you need working code now, I have a workaround.

The solution that worked for me was to specify the timestamp columns to be millisecond precision. If you need nanosecond precision, this will ruin your data... but if that's the case, it may be the least of your problems.
```
import pandas as pd

table1 = pd.read_parquet(path=('path1.parquet'))
table2 = pd.read_parquet(path=('path2.parquet'))

table1["Date"] = table1["Date"].astype("datetime64[ms]")
table2["Date"] = table2["Date"].astype("datetime64[ms]")

table = pd.concat([table1, table2], ignore_index=True) 
table.to_parquet('./file.gzip', compression='gzip')
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
有刺的猬

2021-02-19 01:17

Pandas already forwards unknown kwargs to the underlying parquet-engine since at least v0.22. As such, using table.to_parquet(allow_truncated_timestamps=True) should work - I verified it for pandas v0.25.0 and pyarrow 0.13.0. For more keywords see the pyarrow docs.

0 讨论(0)
发布评论:

提交评论
- 加载中...
我在风中等你

2021-02-19 01:22
Thanks to @axel for the link to Apache Arrow documentation:

allow_truncated_timestamps (bool, default False) – Allow loss of data when coercing timestamps to a particular resolution. E.g. if microsecond or nanosecond data is lost when coercing to ‘ms’, do not raise an exception.

It seems like in modern Pandas versions we can pass parameters to ParquetWriter.

The following code worked properly for me (Pandas 1.1.1, PyArrow 1.0.1):
```
df.to_parquet(filename, use_deprecated_int96_timestamps=True)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
庸人自扰

2021-02-19 01:29
I experienced a similar problem while using pd.to_parquet, my final workaround was to use the argument engine='fastparquet', but I realize this doesn't help if you need to use PyArrow specifically.

Things I tried which did not work:
- @DrDeadKnee's workaround of manually casting columns .astype("datetime64[ms]") did not work for me (pandas v. 0.24.2)
- Passing coerce_timestamps='ms' as a kwarg to the underlying parquet operation did not change behaviour.
0 讨论(0)
发布评论:

提交评论
- 加载中...
被撕碎了的回忆

2021-02-19 01:34

I experienced a related order-of-magnitude problem when writing dask DataFrames with datetime64[ns] columns to AWS S3 and crawling them into Athena tables.

The problem was that subsequent Athena queries showed the datetime fields as year >57000 instead of 2020. I managed to use the following fix:

df.to_parquet(path, times="int96")

Which forwards the kwarg **{"times": "int96"} into fastparquet.writer.write().

I checked the resulting parquet file using package parquet-tools. It indeed shows the datetime columns as INT96 storage format. On Athena (which is based on Presto) the int96 format is well supported and does not have the order of magnitude problem.

Reference: https://github.com/dask/fastparquet/blob/master/fastparquet/writer.py, function write(), kwarg times. (dask 2.30.0 ; fastparquet 0.4.1 ; pandas 1.1.4)

0 讨论(0)
发布评论:

提交评论
- 加载中...