问题
I'm trying to follow the example on this notebook.
As suggested in this github thread:
- I've upped the ulimit to 9999.
- I've already converted the csv files to hdf5
My code fails when trying to open a single hdf5 file into a dataframe:
df = vaex.open('data/chat_history_00.hdf5')
Here's the rest of the code:
import re
import glob
import vaex
import numpy as np
def tryint(s):
try:
return int(s)
except:
return s
def alphanum_key(s):
""" Turn a string into a list of string and number chunks.
"z23a" -> ["z", 23, "a"]
"""
return [ tryint(c) for c in re.split('([0-9]+)', s) ]
hdf5_list = glob.glob('data/*.hdf5')
hdf5_list.sort(key=alphanum_key)
hdf5_list = np.array(hdf5_list)
assert len(hdf5_list) == 11, "Incorrect number of files"
# Check how the single file looks like:
df = vaex.open('data/chat_history_10.hdf5')
df
Error generated:
ERROR:MainThread:vaex:error opening 'data/chat_history_00.hdf5' --------------------------------------------------------------------------- ValueError Traceback (most recent call last) in 1 # Check how the single file looks like: ----> 2 df = vaex.open('data/chat_history_10.hdf5') 3 df
/usr/local/anaconda3/lib/python3.7/site-packages/vaex/init.py in open(path, convert, shuffle, copy_index, *args, **kwargs) 207 ds = from_csv(path, copy_index=copy_index, **kwargs) 208 else: --> 209 ds = vaex.file.open(path, *args, **kwargs) 210 if convert and ds: 211 ds.export_hdf5(filename_hdf5, shuffle=shuffle)
/usr/local/anaconda3/lib/python3.7/site-packages/vaex/file/init.py in open(path, *args, **kwargs) 39 break 40 if dataset_class: ---> 41 dataset = dataset_class(path, *args, **kwargs) 42 return dataset 43
/usr/local/anaconda3/lib/python3.7/site-packages/vaex/hdf5/dataset.py in init(self, filename, write) 84 self.h5table_root_name = None 85 self._version = 1 ---> 86 self._load() 87 88 def write_meta(self):
/usr/local/anaconda3/lib/python3.7/site-packages/vaex/hdf5/dataset.py in _load(self) 182 def _load(self): 183 if "data" in self.h5file: --> 184 self._load_columns(self.h5file["/data"]) 185 self.h5table_root_name = "/data" 186 if "table" in self.h5file:
/usr/local/anaconda3/lib/python3.7/site-packages/vaex/hdf5/dataset.py in _load_columns(self, h5data, first) 348 self.add_column(column_name, self._map_hdf5_array(data, column['mask'])) 349 else: --> 350 self.add_column(column_name, self._map_hdf5_array(data)) 351 else: 352 transposed = shape1 < shape[0]
/usr/local/anaconda3/lib/python3.7/site-packages/vaex/dataframe.py in add_column(self, name, f_or_array, dtype) 2929
if len(self) == len(ar): 2930 raise ValueError("Array is of length %s, while the length of the DataFrame is %s due to the filtering, the (unfiltered) length is %s." % (len(ar), len(self), self.length_unfiltered())) -> 2931 raise ValueError("array is of length %s, while the length of the DataFrame is %s" % (len(ar), self.length_original())) 2932 # assert self.length_unfiltered() == len(data), "columns should be of equal length, length should be %d, while it is %d" % ( self.length_unfiltered(), len(data)) 2933 valid_name = vaex.utils.find_valid_name(name)ValueError: array is of length 2578961, while the length of the DataFrame is 6
What does this mean and how do I troubleshoot it? All the files has 6 columns.
EDIT: Here's how I created the hdf5 file:
pd.read_csv(r'G:/path/to/file/data/chat_history-00.csv').to_hdf(r'data/chat_history_00.hdf5', key='data')
回答1:
The question has been answered by Jovan of vaex on Github:
You should not use pandas .to_hdf if you want to read the data with vaex in a memory-mapped way. Please see this link for more details.
I used this instead:
vdf = vaex.from_pandas(df, copy_index=False)
vdf.export_hdf5('chat_history_00.hdf5')
来源:https://stackoverflow.com/questions/59759479/how-do-i-troubleshoot-valueerror-array-is-of-length-s-while-the-length-of-the