pytables | 易学教程

Database or Table Solution for Temporary Numpy Arrays

阅读更多关于 Database or Table Solution for Temporary Numpy Arrays

问题 I am creating a Python desktop application that allows users to select different distributional forms to model agricultural yield data. I have the time series agricultural data - close to a million rows - saved in a SQLite database (although this is not set in stone if someone knows of a better choice). Once the user selects some data, say corn yields from 1990-2010 in Illinois, I want them to select a distributional form from a drop-down. Next, my function fits the distribution to the data

argsort on a PyTables' array

阅读更多关于 argsort on a PyTables' array

问题 I have a problem with NumPy's argsort. It creates an int64 array of the length of the input array in-memory. Since I'm working with very large arrays, this will blow the memory. I tested NumPy's argsort with a small PyTables' carray and it gives the correct output. Now, what I want is to the sorting algorithm work with a PyTables' array directly. Is there a way to do this with standard NumPy calls or a simple hack into the NumPy internals? I'm also open to non-NumPy alternatives - I just want

PyTables - big memory consumption using cols method

阅读更多关于 PyTables - big memory consumption using cols method

问题 What is the purpose for using cols method in Pytables? I have got big dataset and I am interested in reading only one column from that dataset. These two methods gives me same time, but totally different variable memory consumption: import tables from sys import getsizeof f = tables.open_file(myhdf5_path, 'r') # These two methods takes the same amount of time x = f.root.set1[:500000]['param1'] y = f.root.set1.cols.param1[:500000] # But totally different memory consumption: print(getsizeof(x))

Error when trying to save hdf5 row where one column is a string and the other is an array of floats

阅读更多关于 Error when trying to save hdf5 row where one column is a string and the other is an array of floats

问题 I have two column, one is a string, and the other is a numpy array of floats a = 'this is string' b = np.array([-2.355, 1.957, 1.266, -6.913]) I would like to store them in a row as separate columns in a hdf5 file. For that I am using pandas hdf_key = 'hdf_key' store5 = pd.HDFStore('file.h5') z = pd.DataFrame( { 'string': [a], 'array': [b] }) store5.append(hdf_key, z, index=False) store5.close() However, I get this error TypeError: Cannot serialize the column [array] because its data contents

Pandas to_hdf succeeds but then read_hdf fails

阅读更多关于 Pandas to_hdf succeeds but then read_hdf fails

问题 Pandas to_hdf succeeds but then read_hdf fails when I use custom objects as column headers (I use custom objects because I need to store other info in them). Is there some way to make this work? Or is this just a Pandas bug or PyTables bug? As an example, below, I will show first making a DataFrame foo that uses string column headers, and everything works fine with to_hdf / read_hdf , but then changing foo to use a custom Col class for column headers, to_hdf still works fine but then read_hdf

Pytables table dtype alignment

阅读更多关于 Pytables table dtype alignment

问题 If I create the following aligned Numpy array import numpy as np import tables as pt numrows = 10 dt = np.dtype([('date', [('year', '<i4'), ('month', '<i4'), ('day', '<i4')]), ('apples', '<f8'), ('oranges', '|S7'), ('pears', '<i4')], align=True) x = np.zeros(numrows, dtype=dt) for d in x.dtype.descr: print d and print the dtype.descr I get the following: ('date', [('year', '<i4'), ('month', '<i4'), ('day', '<i4')]) ('', '|V4') ('apples', '<f8') ('oranges', '|S7') ('', '|V1') ('pears', '<i4')

PyTables writing error

阅读更多关于 PyTables writing error

问题 I am creating and filling a PyTables Carray the following way: #a,b = scipy.sparse.csr_matrix f = tb.open_file('../data/pickle/dot2.h5', 'w') filters = tb.Filters(complevel=1, complib='blosc') out = f.create_carray(f.root, 'out', tb.Atom.from_dtype(a.dtype), shape=(l, n), filters=filters) bl = 2048 l = a.shape[0] for i in range(0, l, bl): out[:,i:min(i+bl, l)] = (a.dot(b[:,i:min(i+bl, l)])).toarray() The script was running fine for nearly two days (I estimated that it would need at least 4

Pandas HDFStore: slow on query for non-matching string

阅读更多关于 Pandas HDFStore: slow on query for non-matching string

问题 My issue is that when I try to look for a string that is NOT contained in the DataFrame (which is stored in an hdf5 file), it takes a very long time to complete the query. For example: I have a df that contains 2*10^9 rows. It is stored in an HDF5 file. I have a string column named "code", that was marked as "data_column" (therefore it is indexed). When I search for a code that exists in the dataset ( store.select('df', 'code=valid_code') ) it takes around 10 seconds to get 70K rows. However,

PerformanceWarning - Pandas and Pytables, can I fix this?

阅读更多关于 PerformanceWarning - Pandas and Pytables, can I fix this?

问题 I am getting the following PerformanceWarning: "PerformanceWarning: your performance may suffer as PyTables will pickle object types that it cannot map directly to c-types [inferred_type->mixed-integer,key->block0_values] [items-> ['File1', 'File2', 'File3', 'File4', 'File5']] warnings.warn(ws, PerformanceWarning)" This is because of the "file_attributes" df (see code below), which contains a mix of several things: Typical output for store.file_attributes (basically a bunch of dictionary key

Scala equivalent to pyTables?

阅读更多关于 Scala equivalent to pyTables?

问题 I'm looking for a little assistance in Scala similar to that provided by pyTables. PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data. Any suggestions? 回答1: I had a quick look at pyTables, and I don't think there's anything remotely like it in Scalaland (or indeed Javaland), but we have a few of the ingredients necessary to make it a possibility if you want to invest the time: scala.Dynamic to do idiomatic