问题
Is there a special pyarrow data type I should use for columns which have lists of dictionaries when I save to a parquet file?
If I save lists or lists of dictionaries as a string, I normally have to .apply(eval)
the field if I read it into memory again in order for pandas to recognize the data as a list (so I can normalize it with pd.json_normalize
)
column_a:
[
{"id": "something", "value": "else"},
{"id": "something2", "value": "else2"},
]
column_b:
["test", "test2", "test3"]
Just wondering if I should save this data as something else besides a string.
Edit - pasting a snippet of some raw JSON from Zendesk. The audits field has a field called events which is a list of dictionaries. Inside that, there can be other lists of dictionaries as well (attachments and inside that there is a list of dictionaries called thumbnails)
Are you able to use pa.map_ to handle situations like this? I sometimes need to retrieve data from these nested fields which I do not even know exist initially. In my current parquet dataset, the events field is just a single column (string type) even though there are many nested fields within it.
udt = pa.map_(pa.string(), pa.string())
.
"audit": {
"id": ,
"ticket_id": ,
"created_at": "",
"author_id": ,
"events": [
{
"id": ,
"type": "",
"author_id": ,
"body": "" ,
"plain_body": "",
"public": false,
"attachments": [
{
"url": "",
"id": ,
"file_name": "",
"content_url": "",
"content_type": "image/png",
"size": 2888,
"width": 100,
"height": 30,
"inline": false,
"deleted": false,
"thumbnails": [
{
"url": "",
"id": ,
"file_name": "",
"content_url": "",
"mapped_content_url": "",
"content_type": "image/png",
"size": 2075,
"width": 80,
"height": 24,
"inline": false,
"deleted": false
}
]
},
回答1:
Assuming you have a df with "dictionary" and string columns, and the dictionaries all have the same keys (id, value in your case):
df = pd.DataFrame({
'col1': pd.Series([
{"id": "something", "value": "else"},
{"id": "something2", "value": "else2"}
]),
'col2': pd.Series(['foo', 'bar'])
}
)
udt = pa.struct([pa.field('id', pa.string()), pa.field('value', pa.string())])
schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])
table = pa.Table.from_pandas(df, schema)
df = table.to_pandas()
If your dictionaries don't have the same keys or you don't know the keys of the dictionaries in advance, you can do this:
df = pd.DataFrame({
'col1': pd.Series([
[('id', 'something'), ('value', '"else')],
[('id', 'something2'), ('value','else2')],
]),
'col2': pd.Series(['foo', 'bar'])
}
)
udt = pa.map_(pa.string(), pa.string())
schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])
table = pa.Table.from_pandas(df, schema)
Note that the format for col1 is different (it is using a list of pairs instead of a dict). Also you can't convert your table back to pandas as it is not supported (yet):
table.to_pandas()
>>> ArrowNotImplementedError: No known equivalent Pandas block for Arrow data of type map<string, string> is known.
来源:https://stackoverflow.com/questions/63553715/pyarrow-data-types-for-columns-that-have-lists-of-dictionaries