Error when reading avro files in python

后端 未结 1 833
夕颜
夕颜 2021-01-20 17:49

I installed Apache Avro successfully in Python. Then I try to read Avro files into Python following the instruction below.

https://avro.apache.org/docs/1.8.1         


        
相关标签:
1条回答
  • 2021-01-20 18:50

    You're using windows and Python 3.

    • in Python 3 by default open opens files in text mode. It means that when further read operations happen, Python will try to decode the content of the file from some charset to unicode.

    • you did not specify a default charset, so Python tries to decode the content as if such content was encoded using charmap (by default on windows).

    • obviously your avro file is not encoded in charmap, and the decoded fails with an exception

    • as far as i remember, avro headers anyway are binary content... not textual (not sure about that). so maybe first you should try NOT to decode the file with open:

    reader = DataFileReader(open("part-00000-of-01733.avro", 'rb'), DatumReader())

    (notice 'rb', binary mode)

    EDIT: For the next problem (AttributeError), you've been hit by a known bug that's not fixed in 1.8.1. Until next version is out, you could just do something like:

    import avro.schema
    from avro.datafile import DataFileReader, DataFileWriter, VALID_CODECS, SCHEMA_KEY
    from avro.io import DatumReader, DatumWriter
    from avro import io as avro_io
    
    
    class MyDataFileReader(DataFileReader):
        def __init__(self, reader, datum_reader):
            """Initializes a new data file reader.
    
            Args:
              reader: Open file to read from.
              datum_reader: Avro datum reader.
            """
            self._reader = reader
            self._raw_decoder = avro_io.BinaryDecoder(reader)
            self._datum_decoder = None  # Maybe reset at every block.
            self._datum_reader = datum_reader
    
            # read the header: magic, meta, sync
            self._read_header()
    
            # ensure codec is valid
            avro_codec_raw = self.GetMeta('avro.codec')
            if avro_codec_raw is None:
                self.codec = "null"
            else:
                self.codec = avro_codec_raw.decode('utf-8')
            if self.codec not in VALID_CODECS:
                raise DataFileException('Unknown codec: %s.' % self.codec)
    
            self._file_length = self._GetInputFileLength()
    
            # get ready to read
            self._block_count = 0
            self.datum_reader.writer_schema = (
                schema.Parse(self.GetMeta(SCHEMA_KEY).decode('utf-8')))
    
    
    reader = MyDataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
    for user in reader:
        print (user)
    reader.close()
    

    It is very strange that such stupid bug could go to releases though, and that's not a sign a code maturity!

    0 讨论(0)
提交回复
热议问题