hdfstore

Pandas dataframe and speed

删除回忆录丶 提交于 2019-12-25 04:27:05
问题 I have a pandas dataframe object which I have preallocated with 400 000 entries. 2 columns a timestamp of type datetime.datetime and a float number. When I attempt to insert (overwrite) a row in the table it seems rather slow, depending on the size of the table I get something like 0.044seconds. I have created an index of integer and I am using this index to access the row. Here is how I am using it: maxsize = 400000 data = pd.DataFrame({'ts' : date_list, 'val' : zeros}, index=range(maxsize))

HDFStore Term memory efficient way to check for membership in list

夙愿已清 提交于 2019-12-24 03:51:32
问题 I have a pandas HDFStore that I am try to select from. I would like to select data between a two timestamps with an id in a large np.array. The following code works but takes up too much memory only when queried for membership in a list. If I use a datetimeindex and a range, the memory footprint is 95% less. #start_ts, end_ts are timestamps #instruments is an array of python objects not_memory_efficient = adj_data.select("US", [Term("date",">=", start_ts), Term("date", "<=", end_ts), Term("id

Dynamically appending to a Pandas Dataframe

自作多情 提交于 2019-12-23 23:04:15
问题 I have been playing with Pandas to get HTTP logs into Pandas for analysis as it is a good source of large volumes of data and will allow me to learn Pandas. I get the logs streamed in one line at a time and therefore can not import from CSV and need to 'pump' these into a Pandas DataFrame, which I will then persist to a HDFStore file. The code I have written at the moment does read from a GZIP just so I can get the process going, but once I have the Panda's bit done I will modify it to be

Pandas HDFStore select from nested columns

≯℡__Kan透↙ 提交于 2019-12-22 11:24:11
问题 I have the following DataFrame, which is stored in an HDFStore object as a frame_table called data: shipmentid qty catid 1 2 3 4 5 0 0 0 0 0 0 0 1 1 0 0 0 2 0 2 2 2 0 0 0 0 3 3 0 4 0 0 0 0 0 0 0 0 0 0 I want to do store.select('data','shipmentid==2') , but I get the error that 'shipmentid' is not defined: ValueError: The passed where expression: shipmentid==2 contains an invalid variable reference all of the variable refrences must be a reference to an axis (e.g. 'index' or 'columns'), or a

How to effiiciently rebuild pandas hdfstore table when append fails

自作多情 提交于 2019-12-22 08:53:16
问题 I am working on using the hdfstore in pandas to data frames from an ongoing iterative process. At each iteration, I append to a table in the hdfstore. Here is a toy example: import pandas as pd from pandas import HDFStore import numpy as np from random import choice from string import ascii_letters alphanum=np.array(list(ascii_letters)+range(0,9)) def hdfstore_append(storefile,key,df,format="t",columns=None,data_columns=None): if df is None: return if key[0]!='/': key='/'+key with HDFStore

Pandas HDFStore: slow on query for non-matching string

给你一囗甜甜゛ 提交于 2019-12-11 03:57:45
问题 My issue is that when I try to look for a string that is NOT contained in the DataFrame (which is stored in an hdf5 file), it takes a very long time to complete the query. For example: I have a df that contains 2*10^9 rows. It is stored in an HDF5 file. I have a string column named "code", that was marked as "data_column" (therefore it is indexed). When I search for a code that exists in the dataset ( store.select('df', 'code=valid_code') ) it takes around 10 seconds to get 70K rows. However,

Read the properties of HDF file in Python

谁说胖子不能爱 提交于 2019-12-08 12:18:03
问题 I have a problem reading hdf file in pandas. As of now, I don't know the keys of the file. How do I read the file [data.hdf] in such a case? And, my file is .hdf not .h5 , Does it make a difference it terms data fetching? I see that you need a 'group identifier in the store' pandas.io.pytables.read_hdf(path_or_buf, key, **kwargs) I was able to get the metadata from pytables File(filename=data.hdf, title='', mode='a', root_uep='/', filters=Filters(complevel=0, shuffle=False, fletcher32=False,

How does one store a Pandas DataFrame as an HDF5 PyTables table (or CArray, EArray, etc.)?

北战南征 提交于 2019-12-07 09:11:54
问题 I have the following pandas dataframe: import pandas as pd df = pd.read_csv(filename.csv) Now, I can use HDFStore to write the df object to file (like adding key-value pairs to a Python dictionary): store = HDFStore('store.h5') store['df'] = df http://pandas.pydata.org/pandas-docs/stable/io.html When I look at the contents, this object is a frame . store outputs <class 'pandas.io.pytables.HDFStore'> File path: store.h5 /df frame (shape->[552,23252]) However, in order to use indexing, one

Import huge data-set from SQL server to HDF5

好久不见. 提交于 2019-12-06 08:22:36
问题 I am trying to import ~12 Million records with 8 columns into Python.Because of its huge size my laptop memory would not be sufficient for this. Now I'm trying to import the SQL data into a HDF5 file format. It would be very helpful if someone can share a snippet of code that queries data from SQL and saves it in the HDF5 format in chunks.I am open to use any other file format that would be easier to use. I plan to do some basic exploratory analysis and later on might create some decision

Get inferred dataframe types iteratively using chunksize

人盡茶涼 提交于 2019-12-05 23:21:26
问题 How can I use pd.read_csv() to iteratively chunk through a file and retain the dtype and other meta-information as if I read in the entire dataset at once? I need to read in a dataset that is too large to fit into memory. I would like to import the file using pd.read_csv and then immediately append the chunk into an HDFStore. However, the data type inference knows nothing about subsequent chunks. If the first chunk stored in the table contains only int and a subsequent chunk contains a float,