问题
I have the following DataFrame, which is stored in an HDFStore object as a frame_table called data:
shipmentid qty
catid 1 2 3 4 5
0 0 0 0 0 0 0
1 1 0 0 0 2 0
2 2 2 0 0 0 0
3 3 0 4 0 0 0
0 0 0 0 0 0 0
I want to do store.select('data','shipmentid==2')
, but I get the error that 'shipmentid' is not defined:
ValueError: The passed where expression: shipmentid==2
contains an invalid variable reference
all of the variable refrences must be a reference to
an axis (e.g. 'index' or 'columns'), or a data_column
The currently defined references are: columns,index
What's the proper way to write this select statement?
EDIT: adding sample code
import pandas as pd
from pandas import *
import random
def createFrame():
data = {
('shipmentid',''):{1:1,2:2,3:3},
('qty',1):{1:5,2:5,3:5},
('qty',2):{1:6,2:6,3:6},
('qty',3):{1:7,2:7,3:7}
}
frame = pd.DataFrame(data)
return frame
def createStore():
store = pd.HDFStore('sample.h5',format='table')
return store
frame = createFrame()
print(frame)
print('\n')
print(frame.info())
store = createStore()
store.put('data',frame,format='t')
print('\n')
print(store)
results = store.select('data','shipmentid == 2')
store.close()
回答1:
I'd bet you've used something like this to create your store,
In [207]:
data = pd.DataFrame(np.random.randn(8,2), columns=['shipmentid', 'qty'])
store = pd.HDFStore('borrar')
store.put('data', data, format='t')
If you then try to do a select
indeed you get the error you describe,
In [208]:
store.select('data', 'shipmentid>0')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-211-5d0c4082cdcf> in <module>()
----> 1 store.select('data', 'shipmentid>0')
...
ValueError: The passed where expression: shipmentid>0
contains an invalid variable reference
all of the variable refrences must be a reference to
Instead, you can create it this way:
In [209]:
data = pd.DataFrame(np.random.randn(8,2), columns=['shipmentid', 'qty'])
data.to_hdf('borrar2', 'data', append=True, mode='w', data_columns=['shipmentid', 'qty'])
In [210]:
pd.read_hdf('borrar2', 'data', where='shipmentid>0')
Out[210]:
shipmentid qty
1 0.778225 -1.008529
5 0.264075 -0.651268
7 0.908880 0.153306
(Honestly, I don't know why it works one way and the other doesn't, my guess is that in the 1st one you can't specify the data columns. But it is one of those things can drive you crazy...)
EDIT:
After the update of the code posted, the dataframe has a MultiIndex
. The analogous updated code would be something like:
In [273]:
import pandas as pd
from pandas import *
import random
def createFrame():
data = {
('shipmentid',''):{1:1,2:2,3:3},
('qty',1):{1:5,2:5,3:5},
('qty',2):{1:6,2:6,3:6},
('qty',3):{1:7,2:7,3:7}
}
frame = pd.DataFrame(data)
return frame
frame = createFrame()
print(frame)
print('\n')
print(frame.info())
frame.to_hdf('sample.h5', 'data', append=True, mode='w', data_columns=['shipmentid'], format='table')
pd.read_hdf('sample.h5','data', 'shipmentid == 2')
But I get an error (I guess you get the same):
qty shipmentid
1 2 3
1 5 6 7 1
2 5 6 7 2
3 5 6 7 3
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 1 to 3
Data columns (total 4 columns):
(qty, 1) 3 non-null int64
(qty, 2) 3 non-null int64
(qty, 3) 3 non-null int64
(shipmentid, ) 3 non-null int64
dtypes: int64(4)
memory usage: 120.0 bytes
None
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-273-e10e811fc7c0> in <module>()
23 print(frame.info())
24
---> 25 frame.to_hdf('sample.h5', 'data', append=True, mode='w', data_columns=['shipmentid'], format='table')
26 pd.read_hdf('sample.h5','data', 'shipmentid == 2')
.....
stack trace
.....
ValueError: cannot use a multi-index on axis [1] with data_columns ['shipmentid']
I've been browsing a bit and I cannot provide a solution for this. My impression is by looking at the code in github is that the option data_columns
cannot be used in combination with a MultiIndex
. The only solution I can think of would be to write to HDFStore
as in your code, and then read the full dataframe, with no conditions and do the search afterwords. That is:
new_frame = store.get('data')
print new_frame[new_frame['shipmentid'] == 2]
<class 'pandas.io.pytables.HDFStore'>
File path: sample.h5
/data frame_table (typ->appendable,nrows->3,ncols->4,indexers->[index])
qty shipmentid
1 2 3
2 5 6 7 2
来源:https://stackoverflow.com/questions/29497694/pandas-hdfstore-select-from-nested-columns