I'm trying to transition from Pandas
into Xarray
for N-Dimensional DataArrays
to expand my repertoire.
Realistically, I'm going to have a bunch of different pd.DataFrames
(in this case row=month, col=attribute) along a particular axis (patients in the mock example below) that I would like to merge (w/o using panels or multindex :), thank you). I want to convert them to xr.DataArrays
so I can build dimensions upon them. I made a mock dataset to give a gist of what I'm talking about.
For this dataset I made up, imagine 100 patients, 12 months, 10000 attributes, 3 replicates (per attribute)
which would be a typical 4D dataset. Basically, I'm condensing the 3 replicates per attribute
by the mean
so I end up with a 2D pd.DataFrame
(row=months, col=attributes) this DataFrame is the value in my dictionary and the patient it came from is the key (i.e. (patient_x : DataFrame_X) )
I'm also going to include a round about way I did it with np.ndarray
placeholder but it would be really convenient if I could generate a N-dimensional DataArray from a dictionary whose key was patient_x and the value was a DataFrame_X
How can I create a N-Dimensional DataArray
using Xarray
from a dictionary of Pandas DataFrames
?
import xarray as xr
import numpy as np
import pandas as pd
np.random.seed(1618033)
#Set dimensions
a,b,c,d = 100,12,10000,3 #100 patients, 12 months, 10000 attributes, 3 replicates
#Create labels
patients = ["patient_%d" % i for i in range(a)]
months = [j for j in range(b)]
attributes = ["attr_%d" % k for k in range(c)]
replicates = [l for l in range(d)]
coords = [patients,months,attributes]
dims = ["Patients","Months","Attributes"]
#Dict of DataFrames
D_patient_DF = dict()
for i, patient in enumerate(patients):
A_placeholder = np.zeros((b,c))
for j, month in enumerate(months):
#Attribute x Replicates
A_attrReplicates = np.random.random((c,d))
#Collapse into 1D Vector
V_attrExp = A_attrReplicates.mean(axis=1)
#Fill array with row
A_placeholder[j,:] = V_attrExp
#Assign dataframe for every patient
DF_data = pd.DataFrame(A_placeholder, index = months, columns = attributes)
D_patient_DF[patient] = DF_data
xr.DataArray(D_patient_DF).dims
#() its empty
D_patient_DF
#{'patient_0': attr_0 attr_1 attr_2 attr_3 attr_4 attr_5 attr_6 \
# 0 0.445446 0.422018 0.343454 0.140700 0.567435 0.362194 0.563799
# 1 0.440010 0.548535 0.810903 0.482867 0.469542 0.591939 0.579344
# 2 0.645719 0.450773 0.386939 0.418496 0.508290 0.431033 0.622270
# 3 0.555855 0.633393 0.555197 0.556342 0.489865 0.204200 0.823043
# 4 0.916768 0.590534 0.597989 0.592359 0.484624 0.478347 0.507789
# 5 0.847069 0.634923 0.591008 0.249107 0.655182 0.394640 0.579700
# 6 0.700385 0.505331 0.377745 0.651936 0.334216 0.489728 0.282544
# 7 0.777810 0.423889 0.414316 0.389318 0.565144 0.394320 0.511034
# 8 0.440633 0.069643 0.675037 0.365963 0.647660 0.520047 0.539253
# 9 0.333213 0.328315 0.662203 0.594030 0.790758 0.754032 0.602375
# 10 0.470330 0.419496 0.171292 0.677439 0.683759 0.646363 0.465788
# 11 0.758556 0.674664 0.801860 0.612087 0.567770 0.801514 0.179939
From a dictionary of DataFrames, you might convert each value into a DataArray (adding dimensions labels), load the results into a Dataset and then convert into a DataArray:
variables = {k: xr.DataArray(v, dims=['month', 'attribute'])
for k, v in D_patient_DF.items()}
combined = xr.Dataset(variables).to_array(dim='patient')
print(combined)
However, beware that the result will not necessarily be ordered in sorted order, but rather use the arbitrary order of dictionary iteration. If you want sorted order, you should use an OrderedDict instead (insert after setting variables
above):
variables = collections.OrderedDict((k, variables[k]) for k in patients)
This outputs:
<xarray.DataArray (patient: 100, month: 12, attribute: 10000)>
array([[[ 0.61176399, 0.26172557, 0.74657302, ..., 0.43742111,
0.47503291, 0.37263983],
[ 0.34970732, 0.81527751, 0.53612895, ..., 0.68971198,
0.68962168, 0.75103198],
[ 0.71282751, 0.23143891, 0.28481889, ..., 0.52612376,
0.56992843, 0.3483683 ],
...,
[ 0.84627257, 0.5033482 , 0.44116194, ..., 0.55020168,
0.48151353, 0.36374339],
[ 0.53336826, 0.59566147, 0.45269417, ..., 0.41951078,
0.46815364, 0.44630235],
[ 0.25720899, 0.18738289, 0.66639783, ..., 0.36149276,
0.58865823, 0.33918553]],
...,
[[ 0.42933273, 0.58642504, 0.38716496, ..., 0.45667285,
0.72684589, 0.52335464],
[ 0.34946576, 0.35821339, 0.33097093, ..., 0.59037927,
0.30233665, 0.6515749 ],
[ 0.63673498, 0.31022272, 0.65788374, ..., 0.47881873,
0.67825066, 0.58704331],
...,
[ 0.44822441, 0.502429 , 0.50677081, ..., 0.4843405 ,
0.84396521, 0.45460029],
[ 0.61336348, 0.46338301, 0.60715273, ..., 0.48322379,
0.66530209, 0.52204897],
[ 0.47520639, 0.43490559, 0.27309414, ..., 0.35280585,
0.30280485, 0.77537204]]])
Coordinates:
* month (month) int64 0 1 2 3 4 5 6 7 8 9 10 11
* patient (patient) <U10 'patient_80' 'patient_73' 'patient_79' ...
* attribute (attribute) object 'attr_0' 'attr_1' 'attr_2' 'attr_3' ...
Alternatively, you could create a list of 2D DataArrays and then use concat
:
patient_list = []
for i, patient in enumerate(patients):
df = ...
array = xr.DataArray(df, dims=['patient', 'attribute'])
patient_list.append(df)
combined = xr.concat(patient_list, dim=pd.Index(patients, name='patient')
This would give the same result, and is probably the cleanest code.
来源:https://stackoverflow.com/questions/36948476/create-dataarray-from-dict-of-2d-dataframes-arrays