pytables | 易学教程

finding a duplicate in a hdf5 pytable with 500e6 rows

阅读更多关于 finding a duplicate in a hdf5 pytable with 500e6 rows

Problem I have a large (> 500e6 rows) dataset that I've put into a pytables database. Lets say first column is ID, second column is counter for each ID. each ID-counter combination has to be unique. I have one non-unique row amongst 500e6 rows I'm trying to find. As a starter I've done something like this: index1 = db.cols.id.create_index() index2 = db.cols.counts.create_index() for row in db: query = '(id == %d) & (counts == %d)' % (row['id'], row['counts']) result = th.readWhere(query) if len(result) > 1: print row It's a brute force method I'll admit. Any suggestions on improvements? update

Why does pandas convert unsigned int greater than 2**63-1 to objects?

阅读更多关于 Why does pandas convert unsigned int greater than 2**63-1 to objects?

When I convert a numpy array to a pandas data frame pandas changes uint64 types to object types if the integer is greater than 2^63 - 1. import pandas as pd import numpy as np x = np.array([('foo', 2 ** 63)], dtype = np.dtype([('string', np.str_, 3), ('unsigned', np.uint64)])) y = np.array([('foo', 2 ** 63 - 1)], dtype = np.dtype([('string', np.str_, 3), ('unsigned', np.uint64)])) print pd.DataFrame(x).dtypes.unsigned dtype('O') print pd.DataFrame(y).dtypes.unsigned dtype('uint64') This is annoying as I can't write the data frame to a hdf file in the table format: pd.DataFrame(x).to_hdf('x.hdf

Unable to reinstall PyTables for Python 2.7

阅读更多关于 Unable to reinstall PyTables for Python 2.7

I am installing Python 2.7 in addition to 2.7. When installing PyTables again for 2.7, I get this error - Found numpy 1.5.1 package installed. .. ERROR:: Could not find a local HDF5 installation. You may need to explicitly state where your local HDF5 headers and library can be found by setting the HDF5_DIR environment variable or by using the --hdf5 command-line option. I am not clear on the HDF installation. I downloaded again - and copied it into a /usr/local/hdf5 directory. And tried to set the environement vars as suggested in the PyTable install. Has anyone else had this problem that

pd.read_hdf throws 'cannot set WRITABLE flag to True of this array'

阅读更多关于 pd.read_hdf throws 'cannot set WRITABLE flag to True of this array'

问题 When running pd.read_hdf('myfile.h5') I get the following traceback error: [[...some longer traceback]] ~/.local/lib/python3.6/site-packages/pandas/io/pytables.py in read_array(self, key, start, stop) 2487 2488 if isinstance(node, tables.VLArray): -> 2489 ret = node[0][start:stop] 2490 else: 2491 dtype = getattr(attrs, 'value_type', None) ~/.local/lib/python3.6/site-packages/tables/vlarray.py in getitem (self, key) ~/.local/lib/python3.6/site-packages/tables/vlarray.py in read(self, start,

PyTables read random subset

阅读更多关于 PyTables read random subset

Is it possible to read a random subset of rows from HDF5 (via pyTables or, preferably pandas)? I have a very large dataset with million of rows, but only need a sample of few thousands for analysis. And what about reading from compressed HDF file? Using HDFStore docs are here , compression docs are here Random access via a constructed index is supported in 0.13 In [26]: df = DataFrame(np.random.randn(100,2),columns=['A','B']) In [27]: df.to_hdf('test.h5','df',mode='w',format='table') In [28]: store = pd.HDFStore('test.h5') In [29]: nrows = store.get_storer('df').nrows In [30]: nrows Out[30]:

Indexing and Data Columns in Pandas/PyTables

阅读更多关于 Indexing and Data Columns in Pandas/PyTables

问题 http://pandas.pydata.org/pandas-docs/stable/io.html#indexing I'm really confused about this concept of Data columns in Pandas HDF5 IO. Plus there's very little to no information about it to be found on googling it either. Since I'm diving into Pandas in a large project which involves HDF5 storage, I'd like to be clear about such concepts. The docs say: You can designate (and index) certain columns that you want to be able to perform queries (other than the indexable columns, which you can

Query HDF5 in Pandas

阅读更多关于 Query HDF5 in Pandas

I have following data (18,619,211 rows) stored as a pandas dataframe object in hdf5 file: date id2 w id 100010 1980-03-31 10401 0.000839 100010 1980-03-31 10604 0.020140 100010 1980-03-31 12490 0.026149 100010 1980-03-31 13047 0.033560 100010 1980-03-31 13303 0.001657 where id is index and others are columns. date is np.datetime64 . I need to perform query like this (the code doesn't work of course): db=pd.HDFStore('database.h5') data=db.select('df', where='id==id_i & date>bgdt & date<endt') Note id_i, bgdt, endt are all variables, not actual values and need to be passed within a loop. for

HDFStore.append(string, DataFrame) fails when string column contents are longer than those already there

阅读更多关于 HDFStore.append(string, DataFrame) fails when string column contents are longer than those already there

I have a Pandas DataFrame stored via an HDFStore that essentially stores summary rows about test runs I am doing. Several of the fields in each row contain descriptive strings of variable length. When I do a test run, I create a new DataFrame with a single row in it: def export_as_df(self): return pd.DataFrame(data=[self._to_dict()], index=[datetime.datetime.now()]) And then call HDFStore.append(string, DataFrame) to add the new row to the existing DataFrame. This works fine, apart from where one of the string columns contents is larger than the longest instance already existing, whereupon I

What is a better approach of storing and querying a big dataset of meteorological data

阅读更多关于 What is a better approach of storing and querying a big dataset of meteorological data

I am looking for a convenient way to store and to query huge amount of meteorological data (few TB). More information about the type of data in the middle of the question. Previously I was looking in the direction of MongoDB (I was using it for many of my own previous projects and feel comfortable dealing with it), but recently I found out about HDF5 data format. Reading about it, I found some similarities with Mongo: HDF5 simplifies the file structure to include only two major types of object: Datasets, which are multidimensional arrays of a homogenous type Groups, which are container

HDFStore with string columns gives issues

阅读更多关于 HDFStore with string columns gives issues

I have a pandas DataFrame myDF with a few string columns (whose dtype is object ) and many numeric columns. I tried the following: d=pandas.HDFStore("C:\\PF\\Temp.h5") d['test']=myDF I got this result: C:\PF\WinPython-64bit-3.3.3.3\python-3.3.3.amd64\lib\site-packages\pandas\io\pytables.py:2446: PerformanceWarning: your performance may suffer as PyTables will pickle object types that it cannot map directly to c-types [inferred_type->mixed,key->block2_values] [items->[0, 1, 3, 4, 5, 6, 9, 10, 292, 411, 412, 477, 478, 479, 495, 572, 581, 590, 599, 608, 617, 626, 635]] warnings.warn(ws,