Python UTF-16 CSV reader

前端 未结 4 693
温柔的废话
温柔的废话 2020-11-27 21:49

I have a UTF-16 CSV file which I have to read. Python csv module does not seem to support UTF-16.

I am using python 2.7.2. CSV files I need to parse are huge size ru

相关标签:
4条回答
  • 2020-11-27 22:12

    At the moment, the csv module does not support UTF-16.

    In Python 3.x, csv expects a text-mode file and you can simply use the encoding parameter of open to force another encoding:

    # Python 3.x only
    import csv
    with open('utf16.csv', 'r', encoding='utf16') as csvf:
        for line in csv.reader(csvf):
            print(line) # do something with the line
    

    In Python 2.x, you can recode the input:

    # Python 2.x only
    import codecs
    import csv
    
    class Recoder(object):
        def __init__(self, stream, decoder, encoder, eol='\r\n'):
            self._stream = stream
            self._decoder = decoder if isinstance(decoder, codecs.IncrementalDecoder) else codecs.getincrementaldecoder(decoder)()
            self._encoder = encoder if isinstance(encoder, codecs.IncrementalEncoder) else codecs.getincrementalencoder(encoder)()
            self._buf = ''
            self._eol = eol
            self._reachedEof = False
    
        def read(self, size=None):
            r = self._stream.read(size)
            raw = self._decoder.decode(r, size is None)
            return self._encoder.encode(raw)
    
        def __iter__(self):
            return self
    
        def __next__(self):
            if self._reachedEof:
                raise StopIteration()
            while True:
                line,eol,rest = self._buf.partition(self._eol)
                if eol == self._eol:
                    self._buf = rest
                    return self._encoder.encode(line + eol)
                raw = self._stream.read(1024)
                if raw == '':
                    self._decoder.decode(b'', True)
                    self._reachedEof = True
                    return self._encoder.encode(self._buf)
                self._buf += self._decoder.decode(raw)
        next = __next__
    
        def close(self):
            return self._stream.close()
    
    with open('test.csv','rb') as f:
        sr = Recoder(f, 'utf-16', 'utf-8')
    
        for row in csv.reader(sr):
            print (row)
    

    open and codecs.open require the file to start with a BOM. If it doesn't (or you're on Python 2.x), you can still convert it in memory, like this:

    try:
        from io import BytesIO
    except ImportError: # Python < 2.6
        from StringIO import StringIO as BytesIO
    import csv
    with open('utf16.csv', 'rb') as binf:
        c = binf.read().decode('utf-16').encode('utf-8')
    for line in csv.reader(BytesIO(c)):
        print(line) # do something with the line
    
    0 讨论(0)
  • 2020-11-27 22:13

    The Python 2.x csv module documentation example shows how to handle other encodings.

    0 讨论(0)
  • 2020-11-27 22:16

    I would strongly suggest that you recode your file(s) to UTF-8. Under the very likely condition that you don't have any Unicode characters outside the BMP, you can take advantage of the fact that UTF-16 is a fixed-length encoding to read fixed-length blocks from your input file without worrying about straddling block boundaries.

    Step 1: Determine what encoding you actually have. Examine the first few bytes of your file:

    print repr(open('thefile.csv', 'rb').read(100))

    Four possible ways of encoding u'abc'

    \xfe\xff\x00a\x00b\x00c -> utf_16
    \xff\xfea\x00b\x00c\x00 -> utf_16
    \x00a\x00b\x00c -> utf_16_be
    a\x00b\x00c\x00 -> utf_16_le
    

    If you have any trouble with this step, edit your question to include the results of the above print repr()

    Step 2: Here's a Python 2.X recode-UTF-16*-to-UTF-8 script:

    import sys
    infname, outfname, enc = sys.argv[1:4]
    fi = open(infname, 'rb')
    fo = open(outfname, 'wb')
    BUFSIZ = 64 * 1024 * 1024
    first = True
    while 1:
        buf = fi.read(BUFSIZ)
        if not buf: break
        if first and enc == 'utf_16':
            bom = buf[:2]
            buf = buf[2:]
            enc = {'\xfe\xff': 'utf_16_be', '\xff\xfe': 'utf_16_le'}[bom]
            # KeyError means file doesn't start with a valid BOM
        first = False
        fo.write(buf.decode(enc).encode('utf8'))
    fi.close()
    fo.close()
    

    Other matters:

    You say that your files are too big to read the whole file, recode and rewrite, yet you can open it in vi. Please explain.

    The <85> being treated as end of record is a bit of a worry. Looks like 0x85 is being recognised as NEL (C1 control code, NEWLINE). There is a strong possibility that the data was originally encoded in some legacy single-byte encoding where 0x85 has a meaning but has been transcoded to UTF-16 under the false assumption that the original encoding was ISO-8859-1 aka latin1. Where did the file originate? An IBM mainframe? Windows/Unix/classic Mac? What country, locale, language? You obviously think that the <85> is not meant to be a newline; what do you think that it means?

    Please feel free to send a copy of a cut-down file (that includes some of the <85> stuff) to sjmachin at lexicon dot net

    Update based on 1-line sample data provided.

    This confirms my suspicions. Read this. Here's a quote from it:

    ... the C1 control characters ... are rarely used directly, except on specific platforms such as OpenVMS. When they turn up in documents, Web pages, e-mail messages, etc., which are ostensibly in an ISO-8859-n encoding, their code positions generally refer instead to the characters at that position in a proprietary, system-specific encoding such as Windows-1252 or the Apple Macintosh ("MacRoman") character set that use the codes provided for representation of the C1 set with a single 8-bit byte to instead provide additional graphic characters

    This code:

    s1 = '\xff\xfe1\x00,\x002\x00,\x00G\x00,\x00S\x00,\x00H\x00 \x00f\x00\xfc\x00r\x00 \x00e\x00 \x00\x96\x00 \x00m\x00 \x00\x85\x00,\x00,\x00I\x00\r\x00\n\x00'
    s2 = s1.decode('utf16')
    print 's2 repr:', repr(s2)
    from unicodedata import name
    from collections import Counter
    non_ascii = Counter(c for c in s2 if c >= u'\x80')
    print 'non_ascii:', non_ascii
    for c in non_ascii:
        print "from: U+%04X %s" % (ord(c), name(c, "<no name>"))
        c2 = c.encode('latin1').decode('cp1252')
        print "to:   U+%04X %s" % (ord(c2), name(c2, "<no name>"))
    
    s3 = u''.join(
        c.encode('latin1').decode('1252') if u'\x80' <= c < u'\xA0' else c
        for c in s2
        )
    print 's3 repr:', repr(s3)
    print 's3:', s3
    

    produces the following (Python 2.7.2 IDLE, Windows 7):

    s2 repr: u'1,2,G,S,H f\xfcr e \x96 m \x85,,I\r\n'
    non_ascii: Counter({u'\x85': 1, u'\xfc': 1, u'\x96': 1})
    from: U+0085 <no name>
    to:   U+2026 HORIZONTAL ELLIPSIS
    from: U+00FC LATIN SMALL LETTER U WITH DIAERESIS
    to:   U+00FC LATIN SMALL LETTER U WITH DIAERESIS
    from: U+0096 <no name>
    to:   U+2013 EN DASH
    s3 repr: u'1,2,G,S,H f\xfcr e \u2013 m \u2026,,I\r\n'
    s3: 1,2,G,S,H für e – m …,,I
    

    Which do you think is a more reasonable interpretation of \x96:

    SPA i.e. Start of Protected Area (Used by block-oriented terminals.)
    or
    EN DASH
    ?

    Looks like a thorough analysis of a much larger data sample is warranted. Happy to help.

    0 讨论(0)
  • 2020-11-27 22:28

    Just open your file with codecs.open like in

    import codecs, csv
    
    stream = codecs.open(<yourfile.csv>, encoding="utf-16")
    reader = csv.reader(stream)
    

    And work through your program with unicode strings, as you should do anyway if you are processing text

    0 讨论(0)
提交回复
热议问题