What is the equivalent of “select max(column) from table” in pytables

问题

I have a table with a whole lot of numerical values in it, i know i could extract the column and do a max() on it, but there probably is a way to do this using the in-kernel method. Just cant seem to find it though.

回答1:

In the test I've made, you can achieve over twice faster results using the iterrows method instead of where:

In [117]: timeit max(row['timestamp'] for row in table.iterrows(stop=1000000))
1 loops, best of 3: 1 s per loop

In [118]: timeit max(row['timestamp'] for row in table.where('(timestamp<=Tf)'))
1 loops, best of 3: 2.21 s per loop

In [120]: timeit max(frames.cols.timestamp[:1000000])
1 loops, best of 3: 974 ms per loop

In [121]: timeit np.max(frames.cols.timestamp[:1000000])
1 loops, best of 3: 876 ms per loop

Note that above Tf is the 1000000 entry of that column (which is a Float64).

Since the question does not ask for a comparison check, the where test can be spared... Note that the method proposed in the question (loading the data as numpy array) is still somewhat faster (though the difference is less than 3% and gets further smaller for larger datasets, I did not test over 10^7 rows). Best results I found where using the max numpy function (see above).

I would also be happy to learn of a more efficient method!

回答2:

The fastest way I've found to do this is by indexing your table on the cols you are interested in:

table.cols.timestamp.createCSIndex()

Once indexed, getting a max is almost instant:

max_timestamp = table.cols.timestamp[table.colindexes['timestamp'][-1]]

This will first get the last (corresponding to the largest timestamp) row index from the Index object of your table for the timestamp column (table.colindexes['timestamp'][-1]), and then it will just fetch the row it points to by indexing into the corresponding column reference (table.cols.timestamp).

回答3:

From High Performance Data Management with PyTables & Family (pdf):

e = sum(row['col1'] for row in table.where(3<table.cols.col2<=20))

Modifying this to use max():

e = max(row['col1'] for row in table.where(3<table.cols.col2<=20))

来源：https://stackoverflow.com/questions/9953174/what-is-the-equivalent-of-select-maxcolumn-from-table-in-pytables

标签

python

sql

pytables