Is there anyway to compare two avro files to see what differences exist in the data?

廉价感情. 提交于 2020-01-23 03:55:26

问题


Ideally, I'd like something packaged like SAS proc compare that can give me:

  • The count of rows for each dataset

  • The count of rows that exist in one dataset, but not the other

  • Variables that exist in one dataset, but not the other

  • Variables that do not have the same format in the two files (I realize this would be rare for AVRO files, but would be helpful to know quickly without deciphering errors)

  • The total number of mismatching rows for each column, and a presentation of all the mismatches for a column or any 20 mismatches (whichever is smallest)

I've worked out one way to make sure the datasets are equivalent, but it is pretty inefficient. Lets assume we have two avro files with 100 rows and 5 columns (one key and four float features). If we join the tables and create new variables that are the difference between the matching features from the datasets then any non-zero difference is some mismatch in the data. From there it could be pretty easy to determine the entire list of requirements above, but it just seems like there may be more efficient ways possible.


回答1:


AVRO files store the schema and data separately. This means that beside the AVRO file with the data you should have a schema file, usually it is something like *.avsc. This way your task can be split in 3 parts:

  1. Compare the schema. This way you can get the fields that have different data types in these files, have different set of fields and so on. This task is very easy and can be done outside of the Hadoop, for instance in Python:

    import json
    schema1 = json.load(open('schema1.avsc'))
    schema2 = json.load(open('schema2.avsc'))
    def print_cross (s1set, s2set, message):
        for s in s1set:
            if not s in s2set:
                print message % s
    s1names = set( [ field['name'] for field in schema1['fields'] ] )
    s2names = set( [ field['name'] for field in schema2['fields'] ] )
    print_cross(s1names, s2names, 'Field "%s" exists in table1 and does not exist in table2')
    print_cross(s2names, s1names, 'Field "%s" exists in table2 and does not exist in table1')
    def print_cross2 (s1dict, s2dict, message):
        for s in s1dict:
            if s in s2dict:
                if s1dict[s] != s2dict[s]:
                    print message % (s, s1dict[s], s2dict[s])
    s1types = dict( zip( [ field['name'] for field in schema1['fields'] ],  [ str(field['type']) for field in schema1['fields'] ] ) )
    s2types = dict( zip( [ field['name'] for field in schema2['fields'] ],  [ str(field['type']) for field in schema2['fields'] ] ) )
    print_cross2 (s1types, s2types, 'Field "%s" has type "%s" in table1 and type "%s" in table2')
    

Here's an example of the schemas:

{"namespace": "example.avro",
 "type": "record",
 "name": "User",
 "fields": [
     {"name": "name", "type": "string"},
     {"name": "favorite_number",  "type": ["int", "null"]},
     {"name": "favorite_color", "type": ["string", "null"]}
 ]
}

{"namespace": "example.avro",
 "type": "record",
 "name": "User",
 "fields": [
     {"name": "name", "type": "string"},
     {"name": "favorite_number",  "type": ["int"]},
     {"name": "favorite_color", "type": ["string", "null"]},
     {"name": "test", "type": "int"}
 ]
}

Here's the output:

[localhost:temp]$ python compare.py 
Field "test" exists in table2 and does not exist in table1
Field "favorite_number" has type "[u'int', u'null']" in table1 and type "[u'int']" intable2
  1. If the schemas are equal (and you probably don't need to compare the data if the schemas are not equal), then you can do the comparison in the following way. Easy way that matches any case: calculate md5 hash for each of the rows, join two tables based on the value of this md5 hash. If will give you amount of rows that are the same in both tables, amount of rows specific to table1 and amount of rows specific for table2. It can be easily done in Hive, here's the code of the MD5 UDF: https://gist.github.com/dataminelab/1050002

  2. For comparing the field-to-field you have to know the primary key of the table and join two tables on primary key, comparing the fields

Previously I've developed comparison functions for tables, and they usually looked like this:

  1. Check that both tables exists and available
  2. Compare their schema. If there are some mistmatches in schema - break
  3. If the primary key is specified:
    1. Join both tables on primary key using full outer join
    2. Calculate md5 hash for each row
    3. Output primary keys with diagnosis (PK exists only in table1, PK exists only in table2, PK exists in both tables but the data does not match)
    4. Get the 100 rows same of each problematic class, join with both tables and output into "mistmatch example" table
  4. If the primary key is not specified:
    1. Calculate md5 hash for each row
    2. Full outer join of table1 with table2 on md5hash value
    3. Count number of matching rows, number of rows exists in table1 only, number of rows exists in table2 only
    4. Get 100 rows sample of each mistmatch type and output to "mistmatch example" table

Usually development and debugging such a function takes 4-5 business days



来源:https://stackoverflow.com/questions/26351017/is-there-anyway-to-compare-two-avro-files-to-see-what-differences-exist-in-the-d

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!