问题
I have an AVRO file(created by JAVA) and seems like it is some kind of zipped file for hadoop/mapreduce, i want to 'unzip' (deserialize) it to a flat file. Per record per row.
I learned that there is an AVRO package for python, and I installed it correctly. And run the example to read the AVRO file. However, it came up with the errors below and I am wondering what is going on reading the simplest example? Can anyone help me interpret the errors bellow.
>>> reader = DataFileReader(open("/tmp/Stock_20130812104524.avro", "r"), DatumReader())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../python2.7/site-packages/avro/datafile.py", line 240, in __init__
raise DataFileException('Unknown codec: %s.' % self.codec)
avro.datafile.DataFileException: Unknown codec: snappy.
btw, if I do 'head' of file, and using VI to open up the first few lines of the AVRO file, I could see the schema definition together with some crappy weird characters - probably the zipped content. The starting bit of the raw AVRO file looks like below:
bj^A^D^Tavro.codec^Lsnappy^Vavro.schemaØ${"type":"record","name":"Stoc...
I don't know if those schemas would be necessary to read the AVRO file, something like below:
schema = avro.schema.parse(open("schema").read())
# include schema to do sth...
reader = DataFileReader(open("Stock_20130812104524.avro", "r"), DatumReader())
Thanks in advance.
回答1:
The problem is that if there is no Xcode command line tools installed you cannot get snappy working. You can check by typing gcc at the command prompt to see if it is installed or not. If not then type xcode-select –-install
to install it. Then installing python-snappy should work. Thanks Bin!
回答2:
Try pip install python-snappy
- make sure you have installed snappy first.
回答3:
wget http://www.us.apache.org/dist/avro/avro-1.7.5/java/avro-tools-1.7.5.jar
java -jar avro/avro-tools-1.7.5.jar tojson input.avro > input
More information refers here
来源:https://stackoverflow.com/questions/18453026/read-avro-file-using-python