conversion of pandas dataframe to h2o frame efficiently

时光毁灭记忆、已成空白 提交于 2019-12-10 13:02:04

问题


I have a Pandas dataframe which has Encoding: latin-1 and is delimited by ;. The dataframe is very large almost of size: 350000 x 3800. I wanted to use sklearn initially but my dataframe has missing values (NAN values) so i could not use sklearn's random forests or GBM. So i had to use H2O's Distributed random forests for the Training of the dataset. The main Problem is the dataframe is not efficiently converted when i do h2o.H2OFrame(data). I checked for the possibility for providing the Encoding Options but there is nothing in the documentation.

Do anyone have an idea about this? Any leads could help me. I also want to know if there are any other libraries like H2O which can handle NAN values very efficiently? I know that we can impute the columns but i should not do that in my dataset because my columns are values from different sensors, if the values are not there implies that the sensor is not present. I can use only Python


回答1:


import h2o
import pandas as pd

df = pd.DataFrame({'col1': [1,1,2], 'col2': ['César Chávez Day', 'César Chávez Day', 'César Chávez Day']})
hf = h2o.H2OFrame(df)

Since the problem that you are facing is due to the high number of NANs in the dataset, this should be handled first. There are two ways to do so.

  1. Replace NAN with a single, obviously out-of-range value. Ex. If a feature varies between 0-1 replace all NAN with -1 for that feature.

  2. Use the class Imputer to handle NAN values. This will replace NAN with either of mean, median or mode of that feature.




回答2:


If there are large number of missing values in your data and you want to increase the efficiency of conversion, I would recommend explicitly specifying the column types and NA strings instead of letting H2O interpret it. You can pass a list of strings to be interpreted as NAs and a dictionary specifying column types to H2OFrame() method.

It will also allow you to create custom labels for the sensors that are not present, instead of having a generic "not available" (impute NaN values with a custom string in pandas).

import h2o    

col_dtypes = {'col1_name':col1_type, 'col2_name':col2_type}
na_list = ['NA', 'none', 'nan', 'etc']

hf = h2o.H2OFrame(df, column_types=col_dtypes, na_strings=na_list)

For more information - http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/_modules/h2o/frame.html#H2OFrame

Edit: @ErinLeDell 's suggestion to use h2o.import_file() directly with specifying column dtypes and NA string will give you the largest speed-up.



来源:https://stackoverflow.com/questions/46971969/conversion-of-pandas-dataframe-to-h2o-frame-efficiently

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!