hdfstore | 易学教程

How does one store a Pandas DataFrame as an HDF5 PyTables table (or CArray, EArray, etc.)?

阅读更多关于 How does one store a Pandas DataFrame as an HDF5 PyTables table (or CArray, EArray, etc.)?

I have the following pandas dataframe: import pandas as pd df = pd.read_csv(filename.csv) Now, I can use HDFStore to write the df object to file (like adding key-value pairs to a Python dictionary): store = HDFStore('store.h5') store['df'] = df http://pandas.pydata.org/pandas-docs/stable/io.html When I look at the contents, this object is a frame . store outputs <class 'pandas.io.pytables.HDFStore'> File path: store.h5 /df frame (shape->[552,23252]) However, in order to use indexing, one should store this as a table object. My approach was to try HDFStore.put() i.e. HDFStore.put(key="store.h",

How to effiiciently rebuild pandas hdfstore table when append fails

阅读更多关于 How to effiiciently rebuild pandas hdfstore table when append fails

I am working on using the hdfstore in pandas to data frames from an ongoing iterative process. At each iteration, I append to a table in the hdfstore. Here is a toy example: import pandas as pd from pandas import HDFStore import numpy as np from random import choice from string import ascii_letters alphanum=np.array(list(ascii_letters)+range(0,9)) def hdfstore_append(storefile,key,df,format="t",columns=None,data_columns=None): if df is None: return if key[0]!='/': key='/'+key with HDFStore(storefile) as store: if key not in store.keys(): store.put(key,df,format=format,columns=columns,data

Import huge data-set from SQL server to HDF5

阅读更多关于 Import huge data-set from SQL server to HDF5

I am trying to import ~12 Million records with 8 columns into Python.Because of its huge size my laptop memory would not be sufficient for this. Now I'm trying to import the SQL data into a HDF5 file format. It would be very helpful if someone can share a snippet of code that queries data from SQL and saves it in the HDF5 format in chunks.I am open to use any other file format that would be easier to use. I plan to do some basic exploratory analysis and later on might create some decision trees/Liner regression models using pandas. import pyodbc import numpy as np import pandas as pd con =

Get inferred dataframe types iteratively using chunksize

阅读更多关于 Get inferred dataframe types iteratively using chunksize

How can I use pd.read_csv() to iteratively chunk through a file and retain the dtype and other meta-information as if I read in the entire dataset at once? I need to read in a dataset that is too large to fit into memory. I would like to import the file using pd.read_csv and then immediately append the chunk into an HDFStore. However, the data type inference knows nothing about subsequent chunks. If the first chunk stored in the table contains only int and a subsequent chunk contains a float, an exception will be raised. So I need to first iterate through the dataframe using read_csv and

HDFStore: table.select and RAM usage

阅读更多关于 HDFStore: table.select and RAM usage

问题 I am trying to select random rows from a HDFStore table of about 1 GB. RAM usage explodes when I ask for about 50 random rows. I am using pandas 0-11-dev, python 2.7, linux64 . In this first case the RAM usage fits the size of chunk with pd.get_store("train.h5",'r') as train: for chunk in train.select('train',chunksize=50): pass In this second case, it seems like the whole table is loaded into RAM r=random.choice(400000,size=40,replace=False) train.select('train',pd.Term("index",r)) In this

How does one append large amounts of data to a Pandas HDFStore and get a natural unique index?

阅读更多关于 How does one append large amounts of data to a Pandas HDFStore and get a natural unique index?

问题 I'm importing large amounts of http logs (80GB+) into a Pandas HDFStore for statistical processing. Even within a single import file I need to batch the content as I load it. My tactic thus far has been to read the parsed lines into a DataFrame then store the DataFrame into the HDFStore. My goal is to have the index key unique for a single key in the DataStore but each DataFrame restarts it's own index value again. I was anticipating HDFStore.append() would have some mechanism to tell it to