Error when reading avro files in python

问题

I installed Apache Avro successfully in Python. Then I try to read Avro files into Python following the instruction below.

https://avro.apache.org/docs/1.8.1/gettingstartedpython.html

I have a bunch of Avros in a directory which has already been set as the right path in Python. Here is my code:

import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter

reader = DataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
for user in reader:
   print (user)
reader.close()

However it returns this error:

Traceback (most recent call last):
  File "I:\DJ data\read avro.py", line 5, in <module>
    reader = DataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 349, in __init__
    self._read_header()
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 459, in _read_header
    META_SCHEMA, META_SCHEMA, self.raw_decoder)
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 525, in read_data
    return self.read_record(writer_schema, reader_schema, decoder)
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg \avro\io.py", line 725, in read_record
    field_val = self.read_data(field.type, readers_field.type, decoder)
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 515, in read_data
    return self.read_fixed(writer_schema, reader_schema, decoder)
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 568, in read_fixed
    return decoder.read(writer_schema.size)
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 170, in read
    input_bytes = self.reader.read(n)
  File "I:\Program Files\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 863: character maps to <undefined>

I am indeed aware that in the example in the instruction, a schema is created first. But what is a avsc file? How shall I create it and the corresponding schema in my case? Ideally, I would like to read Avro files into Python and save it into csv format in the disk or dataframe/list type in Python for further analysis. I'm using Python 3 on Windows 7.

EDITED I tried Stephane's code, and it returns a new error

Traceback (most recent call last):
  File "I:\DJ data\read avro.py", line 5, in <module>
    reader = DataFileReader(open("part-00000-of-01733.avro", "rb"), DatumReader())
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 352, in __init__
    self.codec = self.GetMeta('avro.codec').decode('utf-8')
AttributeError: 'NoneType' object has no attribute 'decode'

EDITED2: Stephane's code works in most cases, but sometimes it incurs AssertionError like this

Traceback (most recent call last):
File "I:\DJ data\read avro.py", line 42, in <module>
for user in reader:
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 522, in __next__
datum = self.datum_reader.read(self.datum_decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 480, in read
return self.read_data(self.writer_schema, self.reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 525, in read_data
return self.read_record(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 725, in read_record
field_val = self.read_data(field.type, readers_field.type, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 523, in read_data
return self.read_union(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 689, in read_union
return self.read_data(selected_writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 493, in read_data
return self.read_data(writer_schema, s, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 503, in read_data
return decoder.read_utf8()
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 248, in read_utf8
input_bytes = self.read_bytes()
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 241, in read_bytes
return self.read(nbytes)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 171, in read
assert (len(input_bytes) == n), input_bytes
AssertionError: b'BlackRock Group\n\n17 December 2015\n\nFORM 8.3\n\nPUBLIC OPENING POSITION DISCLOSURE/DEALING DISCLOSURE BY\n\nA PERSON WITH INTERESTS IN RELEVANT SECURITIES REPRESENTING 1% OR MORE\n\nRule 8.3 of the Takeover Code (the "Code") \n\n\n   1.         KEY INFORMATION \n \n (a) Full name of discloser:                                                                        BlackRock, Inc. \n-------------------------------------------------------------------------------------------------  ----------------- \n (b) Owner or controller of interests and short positions disclosed, if diffe

回答1:

You're using windows and Python 3.

in Python 3 by default open opens files in text mode. It means that when further read operations happen, Python will try to decode the content of the file from some charset to unicode.
you did not specify a default charset, so Python tries to decode the content as if such content was encoded using charmap (by default on windows).
obviously your avro file is not encoded in charmap, and the decoded fails with an exception
as far as i remember, avro headers anyway are binary content... not textual (not sure about that). so maybe first you should try NOT to decode the file with open:

reader = DataFileReader(open("part-00000-of-01733.avro", 'rb'), DatumReader())

(notice 'rb', binary mode)

EDIT: For the next problem (AttributeError), you've been hit by a known bug that's not fixed in 1.8.1. Until next version is out, you could just do something like:

import avro.schema
from avro.datafile import DataFileReader, DataFileWriter, VALID_CODECS, SCHEMA_KEY
from avro.io import DatumReader, DatumWriter
from avro import io as avro_io


class MyDataFileReader(DataFileReader):
    def __init__(self, reader, datum_reader):
        """Initializes a new data file reader.

        Args:
          reader: Open file to read from.
          datum_reader: Avro datum reader.
        """
        self._reader = reader
        self._raw_decoder = avro_io.BinaryDecoder(reader)
        self._datum_decoder = None  # Maybe reset at every block.
        self._datum_reader = datum_reader

        # read the header: magic, meta, sync
        self._read_header()

        # ensure codec is valid
        avro_codec_raw = self.GetMeta('avro.codec')
        if avro_codec_raw is None:
            self.codec = "null"
        else:
            self.codec = avro_codec_raw.decode('utf-8')
        if self.codec not in VALID_CODECS:
            raise DataFileException('Unknown codec: %s.' % self.codec)

        self._file_length = self._GetInputFileLength()

        # get ready to read
        self._block_count = 0
        self.datum_reader.writer_schema = (
            schema.Parse(self.GetMeta(SCHEMA_KEY).decode('utf-8')))


reader = MyDataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
for user in reader:
    print (user)
reader.close()

It is very strange that such stupid bug could go to releases though, and that's not a sign a code maturity!

来源：https://stackoverflow.com/questions/40744193/error-when-reading-avro-files-in-python

标签

python-3.x

avro