Chinese Text for H2O DataFrame in Python

只谈情不闲聊 提交于 2019-12-25 03:23:00

问题


I have a utf-8 encoded csv file with Chinese text. When I tried to import as an h2o dataframe, the data is improperly displayed as gibberish.

 dataframe = h2o.import_file('test.csv')

In the resulting dataframe, the column names are correct, but instead of Chinese text, it displays text like this:

 在ç�¡è¦ºäº†ä½ 知é�

I looked into h2o documentation and there doesn't seem to be any way to set an encoding option like in pandas when using import_file. Further, when running the following:

testing = ['你','好','嗎']
h2o.H2OFrame(testing)

it gives this error:

--------------------------------------------------------------------------
 UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-2-5f4b3eb49a84> in <module>
      1 testing = ['你','好','嗎']
----> 2 h2o.H2OFrame(testing)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\h2o\frame.py in __init__(self, python_obj, destination_frame, header, separator, column_names, column_types, na_strings, skipped_columns)
    104         if python_obj is not None:
    105             self._upload_python_object(python_obj, 
destination_frame, header, separator,
--> 106                                        column_names, 
column_types, na_strings, skipped_columns)
    107 
    108     @staticmethod

~\AppData\Local\Continuum\anaconda3\lib\site-packages\h2o\frame.py in _upload_python_object(self, python_obj, destination_frame, header, separator, column_names, column_types, na_strings, skipped_columns)
    143             csv_writer.writerow([row.get(k, None) for k in col_header])
    144         else:
--> 145             csv_writer.writerows(data_to_write)
    146         tmp_file.close()  # close the streams
    147         self._upload_parse(tmp_path, destination_frame, 1, 
separator, column_names, column_types, na_strings, skipped_columns)

~\AppData\Local\Continuum\anaconda3\lib\encodings\cp1252.py in encode(self, input, final)
     17 class IncrementalEncoder(codecs.IncrementalEncoder):
     18     def encode(self, input, final=False):
---> 19         return codecs.charmap_encode(input,self.errors,encoding_table)[0]
     20 
     21 class IncrementalDecoder(codecs.IncrementalDecoder):

UnicodeEncodeError: 'charmap' codec can't encode character '\u4f60' in position 1: character maps to <undefined>

Based on this error, it seems that cp1252 encoding is being used by h2o. Can someone offer help to have h2o import the csv file with Chinese to be in utf-8 encoding? Thank you.


回答1:


The jira ticket in the comments has been resolved, and this parsing issue is no longer an issue with newer version of H2O. My recommendation would be to upgrade - for example if you upgrade to latest version of H2O you shouldn't have any issues.

I did a test with version 3.22.0.2 with your example and got:

In [6]: h2o.H2OFrame(testing)
Parse progress: |█████████████████████████████████████████████████████████████████████████████| 100%
Out[6]:
C1
----
你
好
嗎

[3 rows x 1 column]


来源:https://stackoverflow.com/questions/53863717/chinese-text-for-h2o-dataframe-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!