When to use multiindexing vs. xarray in pandas

北城余情 提交于 2019-12-03 06:16:39

问题


The pandas pivot tables documentation seems to recomend dealing with more than two dimensions of data by using multiindexing:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: import pandas.util.testing as tm; tm.N = 3

In [4]: def unpivot(frame):
   ...:         N, K = frame.shape
   ...:         data = {'value' : frame.values.ravel('F'),
   ...:                 'variable' : np.asarray(frame.columns).repeat(N),
   ...:                 'date' : np.tile(np.asarray(frame.index), K)}
   ...:         return pd.DataFrame(data, columns=['date', 'variable', 'value'])
   ...: 

In [5]: df = unpivot(tm.makeTimeDataFrame())

In [6]: df
Out[6]: 
         date variable     value    value2
0  2000-01-03        A  0.462461  0.924921
1  2000-01-04        A -0.517911 -1.035823
2  2000-01-05        A  0.831014  1.662027
3  2000-01-03        B -0.492679 -0.985358
4  2000-01-04        B -1.234068 -2.468135
5  2000-01-05        B  1.725218  3.450437
6  2000-01-03        C  0.453859  0.907718
7  2000-01-04        C -0.763706 -1.527412
8  2000-01-05        C  0.839706  1.679413
9  2000-01-03        D -0.048108 -0.096216
10 2000-01-04        D  0.184461  0.368922
11 2000-01-05        D -0.349496 -0.698993

In [7]: df['value2'] = df['value'] * 2

In [8]: df.pivot('date', 'variable')
Out[8]: 
               value                                  value2            \
variable           A         B         C         D         A         B   
date                                                                     
2000-01-03 -1.558856 -1.144732 -0.234630 -1.252482 -3.117712 -2.289463   
2000-01-04 -1.351152 -0.173595  0.470253 -1.181006 -2.702304 -0.347191   
2000-01-05  0.151067 -0.402517 -2.625085  1.275430  0.302135 -0.805035   


variable           C         D  
date                            
2000-01-03 -0.469259 -2.504964  
2000-01-04  0.940506 -2.362012  
2000-01-05 -5.250171  2.550861  

I thought that xarray was made for handling multidimensional datasets like this:

In [9]: import xarray as xr

In [10]: xr.DataArray(dict([(var, df[df.variable==var].drop('variable', 1)) for var in np.unique(df.variable)]))
Out[10]: 
<xarray.DataArray ()>
array({'A':         date     value    value2
0 2000-01-03  0.462461  0.924921
1 2000-01-04 -0.517911 -1.035823
2 2000-01-05  0.831014  1.662027, 'C':         date     value    value2
6 2000-01-03  0.453859  0.907718
7 2000-01-04 -0.763706 -1.527412
8 2000-01-05  0.839706  1.679413, 'B':         date     value    value2
3 2000-01-03 -0.492679 -0.985358
4 2000-01-04 -1.234068 -2.468135
5 2000-01-05  1.725218  3.450437, 'D':          date     value    value2
9  2000-01-03 -0.048108 -0.096216
10 2000-01-04  0.184461  0.368922
11 2000-01-05 -0.349496 -0.698993}, dtype=object)

Is one of these approaches better than the other? Why hasn't xarray completely replaced multiindexing?


回答1:


There does seem to be a transition to xarray for doing work on multi-dimensional arrays. Pandas will be depreciating the support for the 3D Panels data structure and in the documentation even suggest using xarray for working with multidemensional arrays:

'Oftentimes, one can simply use a MultiIndex DataFrame for easily working with higher dimensional data.

In addition, the xarray package was built from the ground up, specifically in order to support the multi-dimensional analysis that is one of Panel s main use cases. Here is a link to the xarray panel-transition documentation.'

From the xarray documentation they state their aims and goals:

xarray aims to provide a data analysis toolkit as powerful as pandas but designed for working with homogeneous N-dimensional arrays instead of tabular data...

...Our target audience is anyone who needs N-dimensional labelled arrays, but we are particularly focused on the data analysis needs of physical scientists – especially geoscientists who already know and love netCDF

The main advantage of xarray over using straight numpy is that it makes use of labels in the same way pandas does over multiple dimensions. If you are working with 3-dimensional data using multi-indexing or xarray might be interchangeable. As the number of dimensions grows in your data set xarray becomes much more manageable. I cannot comment on how each performs in terms of efficiency or speed.



来源:https://stackoverflow.com/questions/42876278/when-to-use-multiindexing-vs-xarray-in-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!