pytables | 易学教程

Using pytables, which is more efficient: scipy.sparse or numpy dense matrix?

阅读更多关于 Using pytables, which is more efficient: scipy.sparse or numpy dense matrix?

When using pytables , there's no support (as far as I can tell) for the scipy.sparse matrix formats, so to store a matrix I have to do some conversion, e.g. def store_sparse_matrix(self): grp1 = self.getFileHandle().createGroup(self.getGroup(), 'M') self.getFileHandle().createArray(grp1, 'data', M.tocsr().data) self.getFileHandle().createArray(grp1, 'indptr', M.tocsr().indptr) self.getFileHandle().createArray(grp1, 'indices', M.tocsr().indices) def get_sparse_matrix(self): return sparse.csr_matrix((self.getGroup().M.data, self.getGroup().M.indices, self.getGroup().M.indptr)) The trouble is

Indexing and Data Columns in Pandas/PyTables

阅读更多关于 Indexing and Data Columns in Pandas/PyTables

http://pandas.pydata.org/pandas-docs/stable/io.html#indexing I'm really confused about this concept of Data columns in Pandas HDF5 IO. Plus there's very little to no information about it to be found on googling it either. Since I'm diving into Pandas in a large project which involves HDF5 storage, I'd like to be clear about such concepts. The docs say: You can designate (and index) certain columns that you want to be able to perform queries (other than the indexable columns, which you can always query). For instance say you want to perform this common operation, on-disk, and return just the

Is it possible to reverse lookup the index position for itersorted in PyTables?

阅读更多关于 Is it possible to reverse lookup the index position for itersorted in PyTables?

问题 Context Multi GB database with a simple table that has a column with a completely sorted index (CSI). To iterate through the index without loading all the rows in a batch like where we can: for row in itersorted('indexed_column', step=1): print row This works fine, the next step can be to iterate from a certain index position like: for row in itersorted('indexed_column', start=position, step=1): print row Now the caveat : that position is the position in the index and not the row number! And

Is it possible to reverse lookup the index position for itersorted in PyTables?

阅读更多关于 Is it possible to reverse lookup the index position for itersorted in PyTables?

Context Multi GB database with a simple table that has a column with a completely sorted index (CSI). To iterate through the index without loading all the rows in a batch like where we can: for row in itersorted('indexed_column', step=1): print row This works fine, the next step can be to iterate from a certain index position like: for row in itersorted('indexed_column', start=position, step=1): print row Now the caveat : that position is the position in the index and not the row number! And it is very easy to find row numbers ( Row.nrow ) with where , get_where_list , etc. Problem Is it

PyTables + Pandas Select Problems

阅读更多关于 PyTables + Pandas Select Problems

问题 I have a HDF5 (PyTables) file structured like this: /<User>/<API Key> ex: /Dan/A4N5 /Dan/B8P0 /Dave/D3Y7 Each table is structured like so with a sessionID and a time stored in epoch: sessionID time 0 3ODE3Nzll 1467590400 1 lMGVkMDc4 1467590400 2 jNzIzNmY1 1467590400 ... I want Pandas to go through each table and get all the rows between a specified date and the day before the specified date. Currently I have this code: scriptPath = os.path.dirname(os.path.abspath(__file__)) argdate = "2016/07

Concatenate two big pandas.HDFStore HDF5 files

阅读更多关于 Concatenate two big pandas.HDFStore HDF5 files

This question is somehow related to "Concatenate a large number of HDF5 files" . I have several huge HDF5 files (~20GB compressed), which could not fit the RAM. Each of them stores several pandas.DataFrame s of identical format and with indexes that do not overlap. I'd like to concatenate them to have a single HDF5 file with all DataFrames properly concatenated. One way to do this is to read each of them chunk-by-chunk and then save to a single file, but indeed it would take quite a lot of time. Are there any special tools or methods to do this without iterating through files? see docs here

Concatenate two big pandas.HDFStore HDF5 files

阅读更多关于 Concatenate two big pandas.HDFStore HDF5 files

问题 This question is somehow related to "Concatenate a large number of HDF5 files". I have several huge HDF5 files (~20GB compressed), which could not fit the RAM. Each of them stores several pandas.DataFrame s of identical format and with indexes that do not overlap. I'd like to concatenate them to have a single HDF5 file with all DataFrames properly concatenated. One way to do this is to read each of them chunk-by-chunk and then save to a single file, but indeed it would take quite a lot of

pandas pytables append: performance and increase in file size

阅读更多关于 pandas pytables append: performance and increase in file size

I have more than 500 PyTables stores that contain about 300Mb of data each. I would like to merge these files into a big store, using pandas append as in the code below. def merge_hdfs(file_list, merged_store): for file in file_list: store = HDFStore(file, mode='r') merged_store.append('data', store.data) store.close() The append operation is very slow (it is taking up to 10 minutes to append a single store to merged_store ), and strangely the file size of merged_store seems to be increasing by 1Gb for each appended store. I have indicated the total number of expected rows which according to

pandas pytables append: performance and increase in file size

阅读更多关于 pandas pytables append: performance and increase in file size

问题 I have more than 500 PyTables stores that contain about 300Mb of data each. I would like to merge these files into a big store, using pandas append as in the code below. def merge_hdfs(file_list, merged_store): for file in file_list: store = HDFStore(file, mode='r') merged_store.append('data', store.data) store.close() The append operation is very slow (it is taking up to 10 minutes to append a single store to merged_store ), and strangely the file size of merged_store seems to be increasing

What is the advantage of PyTables? [closed]

阅读更多关于 What is the advantage of PyTables? [closed]

I have recently started learning about PyTables and found it very interesting. My question is: What are the basic advantages of PyTables over database(s) when it comes to huge datasets? What is the basic purpose of this package (I can do same sort of structuring in NumPy and Pandas, so what's the big deal with PyTables)? Is it really helpful in analysis of big datasets? Can anyone elaborate with the help of any example and comparisons? Thank you all. abarnert What are the basic advantages of PyTables over database(s) when it comes to huge datasets? Effectively, it is a database. Of course it's