Pandas HDFStore select from nested columns

问题

I have the following DataFrame, which is stored in an HDFStore object as a frame_table called data:

      shipmentid qty            
catid              1  2  3  4  5
0              0   0  0  0  0  0
1              1   0  0  0  2  0
2              2   2  0  0  0  0
3              3   0  4  0  0  0
0              0   0  0  0  0  0

I want to do store.select('data','shipmentid==2'), but I get the error that 'shipmentid' is not defined:

ValueError: The passed where expression: shipmentid==2
            contains an invalid variable reference
            all of the variable refrences must be a reference to
            an axis (e.g. 'index' or 'columns'), or a data_column
            The currently defined references are: columns,index

What's the proper way to write this select statement?

EDIT: adding sample code

import pandas as pd
from pandas import *
import random

def createFrame():
    data = {
             ('shipmentid',''):{1:1,2:2,3:3},
             ('qty',1):{1:5,2:5,3:5},
             ('qty',2):{1:6,2:6,3:6},
             ('qty',3):{1:7,2:7,3:7}
           }
    frame = pd.DataFrame(data)

    return frame

def createStore():
    store = pd.HDFStore('sample.h5',format='table')
    return store    

frame = createFrame()
print(frame)
print('\n')
print(frame.info())

store = createStore()
store.put('data',frame,format='t')
print('\n')
print(store)

results = store.select('data','shipmentid == 2')

store.close()

回答1:

I'd bet you've used something like this to create your store,

In [207]:

data = pd.DataFrame(np.random.randn(8,2), columns=['shipmentid', 'qty'])
store = pd.HDFStore('borrar')
store.put('data', data, format='t')

If you then try to do a select indeed you get the error you describe,

In [208]:

store.select('data', 'shipmentid>0')

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-211-5d0c4082cdcf> in <module>()
----> 1 store.select('data', 'shipmentid>0')

...

ValueError: The passed where expression: shipmentid>0
            contains an invalid variable reference
            all of the variable refrences must be a reference to

Instead, you can create it this way:

In [209]:

data = pd.DataFrame(np.random.randn(8,2), columns=['shipmentid', 'qty'])
data.to_hdf('borrar2', 'data', append=True, mode='w', data_columns=['shipmentid', 'qty'])
In [210]:

pd.read_hdf('borrar2', 'data', where='shipmentid>0')
Out[210]:
shipmentid  qty
1   0.778225    -1.008529
5   0.264075    -0.651268
7   0.908880    0.153306

(Honestly, I don't know why it works one way and the other doesn't, my guess is that in the 1st one you can't specify the data columns. But it is one of those things can drive you crazy...)

EDIT: After the update of the code posted, the dataframe has a MultiIndex. The analogous updated code would be something like:

In [273]:

import pandas as pd
from pandas import *
import random

def createFrame():
    data = {
             ('shipmentid',''):{1:1,2:2,3:3},
             ('qty',1):{1:5,2:5,3:5},
             ('qty',2):{1:6,2:6,3:6},
             ('qty',3):{1:7,2:7,3:7}
           }
    frame = pd.DataFrame(data)

    return frame 

frame = createFrame()
print(frame)
print('\n')
print(frame.info())

frame.to_hdf('sample.h5', 'data', append=True, mode='w', data_columns=['shipmentid'], format='table')
pd.read_hdf('sample.h5','data', 'shipmentid == 2')

But I get an error (I guess you get the same):

  qty       shipmentid
    1  2  3           
1   5  6  7          1
2   5  6  7          2
3   5  6  7          3


<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 1 to 3
Data columns (total 4 columns):
(qty, 1)          3 non-null int64
(qty, 2)          3 non-null int64
(qty, 3)          3 non-null int64
(shipmentid, )    3 non-null int64
dtypes: int64(4)
memory usage: 120.0 bytes
None
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-273-e10e811fc7c0> in <module>()
     23 print(frame.info())
     24 
---> 25 frame.to_hdf('sample.h5', 'data', append=True, mode='w', data_columns=['shipmentid'], format='table')
     26 pd.read_hdf('sample.h5','data', 'shipmentid == 2')
.....
stack trace
.....
ValueError: cannot use a multi-index on axis [1] with data_columns ['shipmentid']

I've been browsing a bit and I cannot provide a solution for this. My impression is by looking at the code in github is that the option data_columns cannot be used in combination with a MultiIndex. The only solution I can think of would be to write to HDFStore as in your code, and then read the full dataframe, with no conditions and do the search afterwords. That is:

new_frame = store.get('data')
print new_frame[new_frame['shipmentid'] == 2]



<class 'pandas.io.pytables.HDFStore'>
File path: sample.h5
/data            frame_table  (typ->appendable,nrows->3,ncols->4,indexers->[index])
  qty       shipmentid
    1  2  3           
2   5  6  7          2

来源：https://stackoverflow.com/questions/29497694/pandas-hdfstore-select-from-nested-columns

标签

python

pandas

hdfstore