How to handle meta data associated with a pandas dataframe?

荒凉一梦 提交于 2019-12-10 17:49:25

问题


What is the best practice for saving meta information to a dataframe? I know of the following coding practice

import pandas as pd
df = pd.DataFrame([])
df.currency = 'USD'
df.measure = 'Price'
df.frequency = 'daily'

But as stated in this post Adding meta-information/metadata to pandas DataFrame this is associated with the risk of losing the information by appling functions such as "groupby, pivot, join or loc" as they may return "a new DataFrame without the metadata attached".

Is this still valid or has there been an update to meta information processing in the meantime? What would be an alternative coding practice?

I do not think building a seperate object is very suitable. Also working with Multiindex does not convince me. Lets say I want to divide a dataframe with prices by a dataframe with earnings. Working with Multiindices would be very involved.

#define price DataFrame
p_index = pd.MultiIndex.from_tuples([['Apple', 'price', 'daily'],['MSFT', 'price', 'daily']])
price = pd.DataFrame([[90, 20], [85, 30], [70, 25]], columns=p_index)

# define earnings dataframe
e_index = pd.MultiIndex.from_tuples(
    [['Apple', 'earnings', 'daily'], ['MSFT', 'earnings', 'daily']])
earnings=pd.DataFrame([[5000, 2000], [5800, 2200], [5100, 3000]], 
                columns=e_index)

price.divide(earnings.values, level=1, axis=0)

In the example above I do not even ensure that the company indices really match. I would probably need to invoke a pd.DataFrame.reindex() or similar. This cannot be a good coding practice in my point of view.

Is there a straightforward solution to the problem of handling meta information in that context that I don't see?

Thank you in advance


回答1:


I think that MultiIndexes is the way to go, but this way:

daily_price_data = pd.DataFrame({'Apple': [90, 85, 30], 'MSFT':[20, 30, 25]})
daily_earnings_data = pd.DataFrame({'Apple': [5000, 58000, 5100], 'MSFT':[2000, 2200, 3000]})
data = pd.concat({'price':daily_price_data, 'earnings': daily_earnings_data}, axis=1)
data


    earnings        price
    Apple   MSFT    Apple   MSFT
0   5000    2000    90      20
1   58000   2200    85      30
2   5100    3000    30      25

Then, to divide:

data['price'] / data['earnings']

If you find that your workflow makes more sense to have companies listed on the first level of the index, then pandas.DataFrame.xs will be very helpful:

data2 = data.reorder_levels([1,0], axis=1).sort_index(axis=1)
data2.xs('price', axis=1, level=-1) / data2.xs('earnings', axis=1, level=-1)


来源:https://stackoverflow.com/questions/39751807/how-to-handle-meta-data-associated-with-a-pandas-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!