问题
I have a utf-8 encoded csv file with Chinese text. When I tried to import as an h2o dataframe, the data is improperly displayed as gibberish.
dataframe = h2o.import_file('test.csv')
In the resulting dataframe, the column names are correct, but instead of Chinese text, it displays text like this:
在ç�¡è¦ºäº†ä½ 知é�
I looked into h2o documentation and there doesn't seem to be any way to set an encoding option like in pandas when using import_file. Further, when running the following:
testing = ['你','好','嗎']
h2o.H2OFrame(testing)
it gives this error:
--------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-2-5f4b3eb49a84> in <module>
1 testing = ['你','好','嗎']
----> 2 h2o.H2OFrame(testing)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\h2o\frame.py in __init__(self, python_obj, destination_frame, header, separator, column_names, column_types, na_strings, skipped_columns)
104 if python_obj is not None:
105 self._upload_python_object(python_obj,
destination_frame, header, separator,
--> 106 column_names,
column_types, na_strings, skipped_columns)
107
108 @staticmethod
~\AppData\Local\Continuum\anaconda3\lib\site-packages\h2o\frame.py in _upload_python_object(self, python_obj, destination_frame, header, separator, column_names, column_types, na_strings, skipped_columns)
143 csv_writer.writerow([row.get(k, None) for k in col_header])
144 else:
--> 145 csv_writer.writerows(data_to_write)
146 tmp_file.close() # close the streams
147 self._upload_parse(tmp_path, destination_frame, 1,
separator, column_names, column_types, na_strings, skipped_columns)
~\AppData\Local\Continuum\anaconda3\lib\encodings\cp1252.py in encode(self, input, final)
17 class IncrementalEncoder(codecs.IncrementalEncoder):
18 def encode(self, input, final=False):
---> 19 return codecs.charmap_encode(input,self.errors,encoding_table)[0]
20
21 class IncrementalDecoder(codecs.IncrementalDecoder):
UnicodeEncodeError: 'charmap' codec can't encode character '\u4f60' in position 1: character maps to <undefined>
Based on this error, it seems that cp1252 encoding is being used by h2o. Can someone offer help to have h2o import the csv file with Chinese to be in utf-8 encoding? Thank you.
回答1:
The jira ticket in the comments has been resolved, and this parsing issue is no longer an issue with newer version of H2O. My recommendation would be to upgrade - for example if you upgrade to latest version of H2O you shouldn't have any issues.
I did a test with version 3.22.0.2 with your example and got:
In [6]: h2o.H2OFrame(testing)
Parse progress: |█████████████████████████████████████████████████████████████████████████████| 100%
Out[6]:
C1
----
你
好
嗎
[3 rows x 1 column]
来源:https://stackoverflow.com/questions/53863717/chinese-text-for-h2o-dataframe-in-python