HDFStore.append(string, DataFrame) fails when string column contents are longer than those already there

只愿长相守 提交于 2019-12-03 11:55:39

Here is the link to the new docs section about this: http://pandas.pydata.org/pandas-docs/stable/io.html#string-columns

This issue is that you are specifiying a column in min_itemsize that is not a data_column. Simple workaround is to add data_columns=True to your append statement, but I have also updated the code to automatically create the data_columns if you pass a valid column name. I think this makes sense, you want to have a minimum column size, so let it happen.

Also created a new doc section String Columns to show a more complete example with explanation (docs will be updated soon).

# this is the new behavior (after code updates)
n [340]: dfs = DataFrame(dict(A = 'foo', B = 'bar'),index=range(5))

In [341]: dfs
     A    B
0  foo  bar
1  foo  bar
2  foo  bar
3  foo  bar
4  foo  bar

# A and B have a size of 30
In [342]: store.append('dfs', dfs, min_itemsize = 30)

In [343]: store.get_storer('dfs').table
/dfs/table (Table(5,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": StringCol(itemsize=30, shape=(2,), dflt='', pos=1)}
  byteorder := 'little'
  chunkshape := (963,)
  autoIndex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False}

# A is created as a data_column with a size of 30
# B is size is calculated
In [344]: store.append('dfs2', dfs, min_itemsize = { 'A' : 30 })

In [345]: store.get_storer('dfs2').table
/dfs2/table (Table(5,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": StringCol(itemsize=3, shape=(1,), dflt='', pos=1),
  "A": StringCol(itemsize=30, shape=(), dflt='', pos=2)}
  byteorder := 'little'
  chunkshape := (1598,)
  autoIndex := True
  colindexes := {
    "A": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False}