Writing Unicode text to a text file?

前端 未结 8 1969
眼角桃花
眼角桃花 2020-11-22 16:46

I\'m pulling data out of a Google doc, processing it, and writing it to a file (that eventually I will paste into a Wordpress page).

It has some non-ASCII symbols. H

相关标签:
8条回答
  • 2020-11-22 17:26

    In Python 2.6+, you could use io.open() that is default (builtin open()) on Python 3:

    import io
    
    with io.open(filename, 'w', encoding=character_encoding) as file:
        file.write(unicode_text)
    

    It might be more convenient if you need to write the text incrementally (you don't need to call unicode_text.encode(character_encoding) multiple times). Unlike codecs module, io module has a proper universal newlines support.

    0 讨论(0)
  • 2020-11-22 17:28

    Deal exclusively with unicode objects as much as possible by decoding things to unicode objects when you first get them and encoding them as necessary on the way out.

    If your string is actually a unicode object, you'll need to convert it to a unicode-encoded string object before writing it to a file:

    foo = u'Δ, Й, ק, ‎ م, ๗, あ, 叶, 葉, and 말.'
    f = open('test', 'w')
    f.write(foo.encode('utf8'))
    f.close()
    

    When you read that file again, you'll get a unicode-encoded string that you can decode to a unicode object:

    f = file('test', 'r')
    print f.read().decode('utf8')
    
    0 讨论(0)
  • 2020-11-22 17:32

    In case of writing in python3

    >>> a = u'bats\u00E0'
    >>> print a
    batsà
    >>> f = open("/tmp/test", "w")
    >>> f.write(a)
    >>> f.close()
    >>> data = open("/tmp/test").read()
    >>> data
    'batsà'
    

    In case of writing in python2:

    >>> a = u'bats\u00E0'
    >>> f = open("/tmp/test", "w")
    >>> f.write(a)
    
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
    

    To avoid this error you would have to encode it to bytes using codecs "utf-8" like this:

    >>> f.write(a.encode("utf-8"))
    >>> f.close()
    

    and decode the data while reading using the codecs "utf-8":

    >>> data = open("/tmp/test").read()
    >>> data.decode("utf-8")
    u'bats\xe0'
    

    And also if you try to execute print on this string it will automatically decode using the "utf-8" codecs like this

    >>> print a
    batsà
    
    0 讨论(0)
  • 2020-11-22 17:35

    The file opened by codecs.open is a file that takes unicode data, encodes it in iso-8859-1 and writes it to the file. However, what you try to write isn't unicode; you take unicode and encode it in iso-8859-1 yourself. That's what the unicode.encode method does, and the result of encoding a unicode string is a bytestring (a str type.)

    You should either use normal open() and encode the unicode yourself, or (usually a better idea) use codecs.open() and not encode the data yourself.

    0 讨论(0)
  • 2020-11-22 17:36

    How to print unicode characters into a file:

    Save this to file: foo.py:

    #!/usr/bin/python -tt
    # -*- coding: utf-8 -*-
    import codecs
    import sys 
    UTF8Writer = codecs.getwriter('utf8')
    sys.stdout = UTF8Writer(sys.stdout)
    print(u'e with obfuscation: é')
    

    Run it and pipe output to file:

    python foo.py > tmp.txt
    

    Open tmp.txt and look inside, you see this:

    el@apollo:~$ cat tmp.txt 
    e with obfuscation: é
    

    Thus you have saved unicode e with a obfuscation mark on it to a file.

    0 讨论(0)
  • 2020-11-22 17:39

    Preface: will your viewer work?

    Make sure your viewer/editor/terminal (however you are interacting with your utf-8 encoded file) can read the file. This is frequently an issue on Windows, for example, Notepad.

    Writing Unicode text to a text file?

    In Python 2, use open from the io module (this is the same as the builtin open in Python 3):

    import io
    

    Best practice, in general, use UTF-8 for writing to files (we don't even have to worry about byte-order with utf-8).

    encoding = 'utf-8'
    

    utf-8 is the most modern and universally usable encoding - it works in all web browsers, most text-editors (see your settings if you have issues) and most terminals/shells.

    On Windows, you might try utf-16le if you're limited to viewing output in Notepad (or another limited viewer).

    encoding = 'utf-16le' # sorry, Windows users... :(
    

    And just open it with the context manager and write your unicode characters out:

    with io.open(filename, 'w', encoding=encoding) as f:
        f.write(unicode_object)
    

    Example using many Unicode characters

    Here's an example that attempts to map every possible character up to three bits wide (4 is the max, but that would be going a bit far) from the digital representation (in integers) to an encoded printable output, along with its name, if possible (put this into a file called uni.py):

    from __future__ import print_function
    import io
    from unicodedata import name, category
    from curses.ascii import controlnames
    from collections import Counter
    
    try: # use these if Python 2
        unicode_chr, range = unichr, xrange
    except NameError: # Python 3
        unicode_chr = chr
    
    exclude_categories = set(('Co', 'Cn'))
    counts = Counter()
    control_names = dict(enumerate(controlnames))
    with io.open('unidata', 'w', encoding='utf-8') as f:
        for x in range((2**8)**3): 
            try:
                char = unicode_chr(x)
            except ValueError:
                continue # can't map to unicode, try next x
            cat = category(char)
            counts.update((cat,))
            if cat in exclude_categories:
                continue # get rid of noise & greatly shorten result file
            try:
                uname = name(char)
            except ValueError: # probably control character, don't use actual
                uname = control_names.get(x, '')
                f.write(u'{0:>6x} {1}    {2}\n'.format(x, cat, uname))
            else:
                f.write(u'{0:>6x} {1}  {2}  {3}\n'.format(x, cat, char, uname))
    # may as well describe the types we logged.
    for cat, count in counts.items():
        print('{0} chars of category, {1}'.format(count, cat))
    

    This should run in the order of about a minute, and you can view the data file, and if your file viewer can display unicode, you'll see it. Information about the categories can be found here. Based on the counts, we can probably improve our results by excluding the Cn and Co categories, which have no symbols associated with them.

    $ python uni.py
    

    It will display the hexadecimal mapping, category, symbol (unless can't get the name, so probably a control character), and the name of the symbol. e.g.

    I recommend less on Unix or Cygwin (don't print/cat the entire file to your output):

    $ less unidata
    

    e.g. will display similar to the following lines which I sampled from it using Python 2 (unicode 5.2):

         0 Cc NUL
        20 Zs     SPACE
        21 Po  !  EXCLAMATION MARK
        b6 So  ¶  PILCROW SIGN
        d0 Lu  Ð  LATIN CAPITAL LETTER ETH
       e59 Nd  ๙  THAI DIGIT NINE
      2887 So  ⢇  BRAILLE PATTERN DOTS-1238
      bc13 Lo  밓  HANGUL SYLLABLE MIH
      ffeb Sm  →  HALFWIDTH RIGHTWARDS ARROW
    

    My Python 3.5 from Anaconda has unicode 8.0, I would presume most 3's would.

    0 讨论(0)
提交回复
热议问题