pytables

finding a duplicate in a hdf5 pytable with 500e6 rows

﹥>﹥吖頭↗ 提交于 2019-12-04 06:14:42
Problem I have a large (> 500e6 rows) dataset that I've put into a pytables database. Lets say first column is ID, second column is counter for each ID. each ID-counter combination has to be unique. I have one non-unique row amongst 500e6 rows I'm trying to find. As a starter I've done something like this: index1 = db.cols.id.create_index() index2 = db.cols.counts.create_index() for row in db: query = '(id == %d) & (counts == %d)' % (row['id'], row['counts']) result = th.readWhere(query) if len(result) > 1: print row It's a brute force method I'll admit. Any suggestions on improvements? update

Why does pandas convert unsigned int greater than 2**63-1 to objects?

怎甘沉沦 提交于 2019-12-04 06:05:27
When I convert a numpy array to a pandas data frame pandas changes uint64 types to object types if the integer is greater than 2^63 - 1. import pandas as pd import numpy as np x = np.array([('foo', 2 ** 63)], dtype = np.dtype([('string', np.str_, 3), ('unsigned', np.uint64)])) y = np.array([('foo', 2 ** 63 - 1)], dtype = np.dtype([('string', np.str_, 3), ('unsigned', np.uint64)])) print pd.DataFrame(x).dtypes.unsigned dtype('O') print pd.DataFrame(y).dtypes.unsigned dtype('uint64') This is annoying as I can't write the data frame to a hdf file in the table format: pd.DataFrame(x).to_hdf('x.hdf

Unable to reinstall PyTables for Python 2.7

北城余情 提交于 2019-12-04 04:46:15
I am installing Python 2.7 in addition to 2.7. When installing PyTables again for 2.7, I get this error - Found numpy 1.5.1 package installed. .. ERROR:: Could not find a local HDF5 installation. You may need to explicitly state where your local HDF5 headers and library can be found by setting the HDF5_DIR environment variable or by using the --hdf5 command-line option. I am not clear on the HDF installation. I downloaded again - and copied it into a /usr/local/hdf5 directory. And tried to set the environement vars as suggested in the PyTable install. Has anyone else had this problem that

pd.read_hdf throws 'cannot set WRITABLE flag to True of this array'

南楼画角 提交于 2019-12-03 19:48:13
问题 When running pd.read_hdf('myfile.h5') I get the following traceback error: [[...some longer traceback]] ~/.local/lib/python3.6/site-packages/pandas/io/pytables.py in read_array(self, key, start, stop) 2487 2488 if isinstance(node, tables.VLArray): -> 2489 ret = node[0][start:stop] 2490 else: 2491 dtype = getattr(attrs, 'value_type', None) ~/.local/lib/python3.6/site-packages/tables/vlarray.py in getitem (self, key) ~/.local/lib/python3.6/site-packages/tables/vlarray.py in read(self, start,

PyTables read random subset

本小妞迷上赌 提交于 2019-12-03 13:25:51
Is it possible to read a random subset of rows from HDF5 (via pyTables or, preferably pandas)? I have a very large dataset with million of rows, but only need a sample of few thousands for analysis. And what about reading from compressed HDF file? Using HDFStore docs are here , compression docs are here Random access via a constructed index is supported in 0.13 In [26]: df = DataFrame(np.random.randn(100,2),columns=['A','B']) In [27]: df.to_hdf('test.h5','df',mode='w',format='table') In [28]: store = pd.HDFStore('test.h5') In [29]: nrows = store.get_storer('df').nrows In [30]: nrows Out[30]:

Indexing and Data Columns in Pandas/PyTables

浪尽此生 提交于 2019-12-03 12:28:00
问题 http://pandas.pydata.org/pandas-docs/stable/io.html#indexing I'm really confused about this concept of Data columns in Pandas HDF5 IO. Plus there's very little to no information about it to be found on googling it either. Since I'm diving into Pandas in a large project which involves HDF5 storage, I'd like to be clear about such concepts. The docs say: You can designate (and index) certain columns that you want to be able to perform queries (other than the indexable columns, which you can

Query HDF5 in Pandas

余生长醉 提交于 2019-12-03 12:26:08
I have following data (18,619,211 rows) stored as a pandas dataframe object in hdf5 file: date id2 w id 100010 1980-03-31 10401 0.000839 100010 1980-03-31 10604 0.020140 100010 1980-03-31 12490 0.026149 100010 1980-03-31 13047 0.033560 100010 1980-03-31 13303 0.001657 where id is index and others are columns. date is np.datetime64 . I need to perform query like this (the code doesn't work of course): db=pd.HDFStore('database.h5') data=db.select('df', where='id==id_i & date>bgdt & date<endt') Note id_i, bgdt, endt are all variables, not actual values and need to be passed within a loop. for

HDFStore.append(string, DataFrame) fails when string column contents are longer than those already there

只愿长相守 提交于 2019-12-03 11:55:39
I have a Pandas DataFrame stored via an HDFStore that essentially stores summary rows about test runs I am doing. Several of the fields in each row contain descriptive strings of variable length. When I do a test run, I create a new DataFrame with a single row in it: def export_as_df(self): return pd.DataFrame(data=[self._to_dict()], index=[datetime.datetime.now()]) And then call HDFStore.append(string, DataFrame) to add the new row to the existing DataFrame. This works fine, apart from where one of the string columns contents is larger than the longest instance already existing, whereupon I

What is a better approach of storing and querying a big dataset of meteorological data

*爱你&永不变心* 提交于 2019-12-03 08:40:12
I am looking for a convenient way to store and to query huge amount of meteorological data (few TB). More information about the type of data in the middle of the question. Previously I was looking in the direction of MongoDB (I was using it for many of my own previous projects and feel comfortable dealing with it), but recently I found out about HDF5 data format. Reading about it, I found some similarities with Mongo: HDF5 simplifies the file structure to include only two major types of object: Datasets, which are multidimensional arrays of a homogenous type Groups, which are container

HDFStore with string columns gives issues

风流意气都作罢 提交于 2019-12-03 07:32:56
I have a pandas DataFrame myDF with a few string columns (whose dtype is object ) and many numeric columns. I tried the following: d=pandas.HDFStore("C:\\PF\\Temp.h5") d['test']=myDF I got this result: C:\PF\WinPython-64bit-3.3.3.3\python-3.3.3.amd64\lib\site-packages\pandas\io\pytables.py:2446: PerformanceWarning: your performance may suffer as PyTables will pickle object types that it cannot map directly to c-types [inferred_type->mixed,key->block2_values] [items->[0, 1, 3, 4, 5, 6, 9, 10, 292, 411, 412, 477, 478, 479, 495, 572, 581, 590, 599, 608, 617, 626, 635]] warnings.warn(ws,