Converting from bytes to French text in Python

爷,独闯天下 提交于 2021-02-02 02:08:52

问题


I am cleaning the monolingual corpus of Europarl for French (http://data.statmt.org/wmt19/translation-task/fr-de/monolingual/europarl-v7.fr.gz). The original raw data in .gzfile (I downloaded using wget). I want to extract the text and see how it looks like in order to further process the corpus.

Using the following code to extract the text from gzip, I obtained data with the class being bytes.

with gzip.open(file_path, 'rb') as f_in:
    print('type(f_in)=', type(f_in))
    text = f_in.read()
    print('type(text)=', type(text))

The printed results for several first lines are as follows:

type(f_in) = class 'gzip.GzipFile'

type(text)= class 'bytes'

b'Reprise de la session\nJe d\xc3\xa9clare reprise la session du Parlement europ\xc3\xa9en qui avait \xc3\xa9t\xc3\xa9 interrompue le vendredi 17 d\xc3\xa9cembre dernier et je vous renouvelle tous mes vux en esp\xc3\xa9rant que vous avez pass\xc3\xa9 de bonnes vacances.\nComme vous avez pu le constater, le grand "bogue de l\'an 2000" ne s\'est pas produit.\n

I tried to decode the binary data with utf8 and ascii with the following code:

with gzip.open(file_path, 'rb') as f_in:
    print('type(f_in)=', type(f_in))
    text = f_in.read().decode('utf8')
    print('type(text)=', type(text))

And it returned errors like this:

UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 26: ordinal not in range(128)

I also tried using codecs and unicodedata packages to open the file but it returned encoding error as well.

Could you please help me explain what I should do to get the French text in the correct format like this for example?

Reprise de la session\nJe déclare reprise la session du Parlement européen qui avait été interrompue le vendredi 17 décembre dernier et je vous renouvelle tous mes vux en espérant que vous avez passé de bonnes vacances.\nComme vous avez pu le constater, le grand "bogue de l'an 2000" ne s'est pas produit.\n

Thank you a ton for your help!


回答1:


The UnicodeEncodeError is occurring because when printing, Python encodes strings to bytes, but in this case, the encoding being used - ASCII - has no character that matches '\xe9', so the error is raised.

Setting the PYTHONIOENCODING environment variable forces Python to use a different encoding - the value of the environment variable. The UTF-8 encoding can encode any character, so calling the program like this solves the issue:

PYTHONIOENCODING=UTF-8 python3  europarl_extractor.py

assuming the code is something like this:

import gzip

if __name__ == '__main__':
    with gzip.open('europarl-v7.fr.gz', 'rb') as f_in:
        bs = f_in.read()
        txt = bs.decode('utf-8')
        print(txt[:100])

The environment variable may be set in other ways - via an export statement, in .bashrc, .profile etc.

An interesting question is why Python is trying to encode output as ASCII. I had assumed that on *nix systems, Python essentially looked at the $LANG environment variable to determine the encoding to use. But in the case the value of $LANG is fr_FR.UTF-8, and yet Python is using ASCII as the output encoding.

From looking at the source for the locale module, and this FAQ, these environment variables are checked, in order:

'LC_ALL', 'LC_CTYPE', 'LANG', 'LANGUAGE'

So it may be that one of LC_ALL or LC_CTYPE has been set to a value that mandates ASCII encoding in your environment (you can check by running the locale command in your terminal; also running locale charmap will tell you the encoding itself).




回答2:


Many thanks for all your help! I found a simple solution to work around. I'm not sure why it works but I think that maybe the .txt format is supported somehow? If you know the mechanism, it would be extremely helpful to know.

with gzip.open(file_path, 'rb') as f_in:
    text = f_in.read()

with open(os.path.join(out_dir, 'europarl.txt'), 'wb') as f_out:
    f_out.write(text)

When I print out the text file in terminal, it looks like this:

Reprise de la session Je déclare reprise la session du Parlement européen qui avait été interrompue le vendredi 17 décembre dernier et je vous renouvelle tous mes vux en espérant que vous avez passé de bonnes vacances. Comme vous avez pu le constater, le grand "bogue de l'an 2000" ne s'est pas produit. En revanche, les citoyens d'un certain nombre de nos pays ont été victimes de catastrophes naturelles qui ont vraiment été terribles. Vous avez souhaité un débat à ce sujet dans les prochains jours, au cours de cette période de session.



来源:https://stackoverflow.com/questions/57197059/converting-from-bytes-to-french-text-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!