pytables

Using pytables, which is more efficient: scipy.sparse or numpy dense matrix?

柔情痞子 提交于 2019-12-03 06:23:47
When using pytables , there's no support (as far as I can tell) for the scipy.sparse matrix formats, so to store a matrix I have to do some conversion, e.g. def store_sparse_matrix(self): grp1 = self.getFileHandle().createGroup(self.getGroup(), 'M') self.getFileHandle().createArray(grp1, 'data', M.tocsr().data) self.getFileHandle().createArray(grp1, 'indptr', M.tocsr().indptr) self.getFileHandle().createArray(grp1, 'indices', M.tocsr().indices) def get_sparse_matrix(self): return sparse.csr_matrix((self.getGroup().M.data, self.getGroup().M.indices, self.getGroup().M.indptr)) The trouble is

Indexing and Data Columns in Pandas/PyTables

你。 提交于 2019-12-03 02:47:36
http://pandas.pydata.org/pandas-docs/stable/io.html#indexing I'm really confused about this concept of Data columns in Pandas HDF5 IO. Plus there's very little to no information about it to be found on googling it either. Since I'm diving into Pandas in a large project which involves HDF5 storage, I'd like to be clear about such concepts. The docs say: You can designate (and index) certain columns that you want to be able to perform queries (other than the indexable columns, which you can always query). For instance say you want to perform this common operation, on-disk, and return just the

Is it possible to reverse lookup the index position for itersorted in PyTables?

我是研究僧i 提交于 2019-12-02 08:12:23
问题 Context Multi GB database with a simple table that has a column with a completely sorted index (CSI). To iterate through the index without loading all the rows in a batch like where we can: for row in itersorted('indexed_column', step=1): print row This works fine, the next step can be to iterate from a certain index position like: for row in itersorted('indexed_column', start=position, step=1): print row Now the caveat : that position is the position in the index and not the row number! And

Is it possible to reverse lookup the index position for itersorted in PyTables?

我与影子孤独终老i 提交于 2019-12-02 05:45:21
Context Multi GB database with a simple table that has a column with a completely sorted index (CSI). To iterate through the index without loading all the rows in a batch like where we can: for row in itersorted('indexed_column', step=1): print row This works fine, the next step can be to iterate from a certain index position like: for row in itersorted('indexed_column', start=position, step=1): print row Now the caveat : that position is the position in the index and not the row number! And it is very easy to find row numbers ( Row.nrow ) with where , get_where_list , etc. Problem Is it

PyTables + Pandas Select Problems

自闭症网瘾萝莉.ら 提交于 2019-12-02 03:57:39
问题 I have a HDF5 (PyTables) file structured like this: /<User>/<API Key> ex: /Dan/A4N5 /Dan/B8P0 /Dave/D3Y7 Each table is structured like so with a sessionID and a time stored in epoch: sessionID time 0 3ODE3Nzll 1467590400 1 lMGVkMDc4 1467590400 2 jNzIzNmY1 1467590400 ... I want Pandas to go through each table and get all the rows between a specified date and the day before the specified date. Currently I have this code: scriptPath = os.path.dirname(os.path.abspath(__file__)) argdate = "2016/07

Concatenate two big pandas.HDFStore HDF5 files

五迷三道 提交于 2019-12-01 15:51:40
This question is somehow related to "Concatenate a large number of HDF5 files" . I have several huge HDF5 files (~20GB compressed), which could not fit the RAM. Each of them stores several pandas.DataFrame s of identical format and with indexes that do not overlap. I'd like to concatenate them to have a single HDF5 file with all DataFrames properly concatenated. One way to do this is to read each of them chunk-by-chunk and then save to a single file, but indeed it would take quite a lot of time. Are there any special tools or methods to do this without iterating through files? see docs here

Concatenate two big pandas.HDFStore HDF5 files

自闭症网瘾萝莉.ら 提交于 2019-12-01 14:52:48
问题 This question is somehow related to "Concatenate a large number of HDF5 files". I have several huge HDF5 files (~20GB compressed), which could not fit the RAM. Each of them stores several pandas.DataFrame s of identical format and with indexes that do not overlap. I'd like to concatenate them to have a single HDF5 file with all DataFrames properly concatenated. One way to do this is to read each of them chunk-by-chunk and then save to a single file, but indeed it would take quite a lot of

pandas pytables append: performance and increase in file size

天涯浪子 提交于 2019-12-01 09:20:35
I have more than 500 PyTables stores that contain about 300Mb of data each. I would like to merge these files into a big store, using pandas append as in the code below. def merge_hdfs(file_list, merged_store): for file in file_list: store = HDFStore(file, mode='r') merged_store.append('data', store.data) store.close() The append operation is very slow (it is taking up to 10 minutes to append a single store to merged_store ), and strangely the file size of merged_store seems to be increasing by 1Gb for each appended store. I have indicated the total number of expected rows which according to

pandas pytables append: performance and increase in file size

不羁岁月 提交于 2019-12-01 05:50:38
问题 I have more than 500 PyTables stores that contain about 300Mb of data each. I would like to merge these files into a big store, using pandas append as in the code below. def merge_hdfs(file_list, merged_store): for file in file_list: store = HDFStore(file, mode='r') merged_store.append('data', store.data) store.close() The append operation is very slow (it is taking up to 10 minutes to append a single store to merged_store ), and strangely the file size of merged_store seems to be increasing

What is the advantage of PyTables? [closed]

空扰寡人 提交于 2019-12-01 00:32:42
I have recently started learning about PyTables and found it very interesting. My question is: What are the basic advantages of PyTables over database(s) when it comes to huge datasets? What is the basic purpose of this package (I can do same sort of structuring in NumPy and Pandas, so what's the big deal with PyTables)? Is it really helpful in analysis of big datasets? Can anyone elaborate with the help of any example and comparisons? Thank you all. abarnert What are the basic advantages of PyTables over database(s) when it comes to huge datasets? Effectively, it is a database. Of course it's