I have following data (18,619,211 rows) stored as a pandas dataframe object in hdf5 file:
date id2 w
id
100010 1980-03-31 10401 0.000839
100010 1980-03-31 10604 0.020140
100010 1980-03-31 12490 0.026149
100010 1980-03-31 13047 0.033560
100010 1980-03-31 13303 0.001657
where id
is index and others are columns. date
is np.datetime64
. I need to perform query like this (the code doesn't work of course):
db=pd.HDFStore('database.h5')
data=db.select('df', where='id==id_i & date>bgdt & date<endt')
Note id_i, bgdt, endt
are all variables, not actual values and need to be passed within a loop. for example:
dates
is a Pandas Period index or timestamps index, either way, I can convert to each other.
dates=['1990-01', 1990-04','1990-09',......]
id_list
is a list of IDs
id_list=[100010, 100011,1000012,.......]
The loop looks like this (the reason I am doing loop is because the data is huge, there are other datasets I have query in the same time and then perform some operations)
db=pd.HDFStore('database.h5')
for id_i in id_list:
for date in dates:
bgdt=date-1 (move to previous month)
endt=date-60 (previous 60 month)
data=db.select('df', where='index==id_i & date>bgdt & date<endt')
......
This problem have 2 parts:
- I don't know how to query index and columns in the same time. The doc in pandas showed how to query based on index conditions OR columns conditions, but no examples on how to query based on them in the SAME TIME.
- (BTW, This is very common in Pandas Documentation. The doc usually shows very simple thing like how to do 'A', OR how to do 'B', but not how to do BOTH 'A' and 'B'. A good example is use
query
on a MultiIndex pandas dataframe. The doc shows based on eitherlevel=0
ORlevel=1
, but no example on how to do BOTH in the SAME TIME.)
- (BTW, This is very common in Pandas Documentation. The doc usually shows very simple thing like how to do 'A', OR how to do 'B', but not how to do BOTH 'A' and 'B'. A good example is use
- I don't know how to pass three variables
id_i, bgdt, endt
to the query. I know how to pass only on by using%s
, but not all of them.- I am also a little confused with the datetime datatype. There seems to be quite a few of datetimes:
datetime.datetime
,numpy.datetime64
,pandas.Period
. I am mostly working on monthly data, sopandas.Period
is the most useful one. But I can't easily convert a column (not index) of timestamps (the default date type of Pandas when parsed from raw data). Is there any datatype that is simply a 'date', not timestamps, not period, but just a simple DATE with only year,month and day?
- I am also a little confused with the datetime datatype. There seems to be quite a few of datetimes:
A lot troubles, but I really LOVE python and pandas (I am trying to move my workflow from SAS to Python). Any help will be appreciated!
here are the docs for querying on non-index columns.
Create the test data. It is not clear how the original frame is constructed, e.g. whether its unique data and the ranges, so I have created a sample, with 10M rows, and a multi-level date range with the id column.
In [60]: np.random.seed(1234)
In [62]: pd.set_option('display.max_rows',20)
In [63]: index = pd.MultiIndex.from_product([np.arange(10000,11000),pd.date_range('19800101',periods=10000)],names=['id','date'])
In [67]: df = DataFrame(dict(id2=np.random.randint(0,1000,size=len(index)),w=np.random.randn(len(index))),index=index).reset_index().set_index(['id','date'])
In [68]: df
Out[68]:
id2 w
id date
10000 1980-01-01 712 0.371372
1980-01-02 718 -1.255708
1980-01-03 581 -1.182727
1980-01-04 202 -0.947432
1980-01-05 493 -0.125346
1980-01-06 752 0.380210
1980-01-07 435 -0.444139
1980-01-08 128 -1.885230
1980-01-09 425 1.603619
1980-01-10 449 0.103737
... ... ...
10999 2007-05-09 8 0.624532
2007-05-10 669 0.268340
2007-05-11 918 0.134816
2007-05-12 979 -0.769406
2007-05-13 969 -0.242123
2007-05-14 950 -0.347884
2007-05-15 49 -1.284825
2007-05-16 922 -1.313928
2007-05-17 347 -0.521352
2007-05-18 353 0.189717
[10000000 rows x 2 columns]
Write the data to disk, showing how to create a data column (note that the indexes are automatically queryable, this allows id2 to be queryable as well). This is de-facto equivalent to doing. This takes care of opening and closing the store (you can accomplish the same thing by opening a store, appending, and closing).
In order to query a column, it MUST BE A DATA COLUMN or an index of the frame.
In [70]: df.to_hdf('test.h5','df',mode='w',data_columns=['id2'],format='table')
In [71]: !ls -ltr test.h5
-rw-rw-r-- 1 jreback users 430540284 May 26 17:16 test.h5
Queries
In [80]: ids=[10101,10898]
In [81]: start_date='20010101'
In [82]: end_date='20010301'
You can specify dates as string (either in-line or as variables; you can also specify Timestamp like objects)
In [83]: pd.read_hdf('test.h5','df',where='date>start_date & date<end_date')
Out[83]:
id2 w
id date
10000 2001-01-02 972 -0.146107
2001-01-03 954 1.420412
2001-01-04 567 1.077633
2001-01-05 87 -0.042838
2001-01-06 79 -1.791228
2001-01-07 744 1.110478
2001-01-08 237 -0.846086
2001-01-09 998 -0.696369
2001-01-10 266 -0.595555
2001-01-11 206 -0.294633
... ... ...
10999 2001-02-19 616 -0.745068
2001-02-20 577 -1.474748
2001-02-21 990 -1.276891
2001-02-22 939 -1.369558
2001-02-23 621 -0.214365
2001-02-24 396 -0.142100
2001-02-25 492 -0.204930
2001-02-26 478 1.839291
2001-02-27 688 0.291504
2001-02-28 356 -1.987554
[58000 rows x 2 columns]
You can use in-line lists
In [84]: pd.read_hdf('test.h5','df',where='date>start_date & date<end_date & id=ids')
Out[84]:
id2 w
id date
10101 2001-01-02 722 1.620553
2001-01-03 849 -0.603468
2001-01-04 635 -1.419072
2001-01-05 331 0.521634
2001-01-06 730 0.008830
2001-01-07 706 -1.006412
2001-01-08 59 1.380005
2001-01-09 689 0.017830
2001-01-10 788 -3.090800
2001-01-11 704 -1.491824
... ... ...
10898 2001-02-19 530 -1.031167
2001-02-20 652 -0.019266
2001-02-21 472 0.638266
2001-02-22 540 -1.827251
2001-02-23 654 -1.020140
2001-02-24 328 -0.477425
2001-02-25 871 -0.892684
2001-02-26 166 0.894118
2001-02-27 806 0.648240
2001-02-28 824 -1.051539
[116 rows x 2 columns]
You can also specify boolean expressions
In [85]: pd.read_hdf('test.h5','df',where='date>start_date & date<end_date & id=ids & id2>500 & id2<600')
Out[85]:
id2 w
id date
10101 2001-01-12 534 -0.220692
2001-01-14 596 -2.225393
2001-01-16 596 0.956239
2001-01-30 513 -2.528996
2001-02-01 572 -1.877398
2001-02-13 569 -0.940748
2001-02-14 541 1.035619
2001-02-21 571 -0.116547
10898 2001-01-16 591 0.082564
2001-02-06 586 0.470872
2001-02-10 531 -0.536194
2001-02-16 586 0.949947
2001-02-19 530 -1.031167
2001-02-22 540 -1.827251
To answer your actual question I would do this (their is really not enough information, but I'll put some reasonable expectations):
- Do't loop over queries, unless you have a very small number of absolute queries
- Read the biggest chunk into memory that you can. Usually this is accomplished by selecting out the biggest ranges of data that you need, even if you select MORE data than you actually need.
- Then subselect using in-memory expressions, which will generally be orders of magnitude faster.
- List elements are limited to about 30 elements in total (this is current an implementation limit on the PyTables side). It will work if you specify more, but what will happen is that you will read in a lot of data, then it will be reindexed down (in-memory). So user needs to be aware of this.
So for example say that you have 1000 unique ids with 10000 dates per as my example demonstrates. You want to select say 200 of these, with a date range of 1000.
So in this case I would simply select on the dates then do the in-memory comparison, something like this:
df = pd.read_hdf('test.h5','df',where='date=>global_start_date & date<=global_end_date')
df[df.isin(list_of_ids)]
You also might have dates that change per ids. So chunk them, this time using a list of ids.
Something like this:
output = []
for i in len(list_of_ids) % 30:
ids = list_of_ids[i:(i+30)]
start_date = get_start_date_for_these_ids (global)
end_date = get_end_date_for_these_ids (global)
where = 'id=ids & start_date>=start_date & end_date<=end_date'
df = pd.read_hdf('test.h5','df',where=where)
output.append(df)
final_result = concat(output)
The basic idea then is to select a superset of the data using the criteria that you want, sub-selecting so it fits in memory, but you limit the number of queries you do (e.g. imagine that you end up selecting a single row with your query, if you have to query this 18M times that is bad).
来源:https://stackoverflow.com/questions/23863553/query-hdf5-in-pandas