Manipulate row data in hadoop to add missing columns

问题

I have log files from IIS stored in hdfs, but due to webserver configuration some of the logs do not have all the columns or they appear in different order. I want to generate files that have a common schema so I can define a Hive table over them.

Example good log:

#Fields: date time s-ip cs-method cs-uri-stem useragent
2013-07-16 00:00:00 10.1.15.8 GET /common/viewFile/1232 Mozilla/5.0+Chrome/27.0.1453.116

Example log with missing columns (cs-method and useragent missing):

#Fields: date time s-ip cs-uri-stem 
2013-07-16 00:00:00 10.1.15.8 /common/viewFile/1232

The log with missing columns needs to be mapped to the full schema like this:

#Fields: date time s-ip cs-method cs-uri-stem useragent
2013-07-16 00:00:00 10.1.15.8 null /common/viewFile/1232 null

The bad logs can have any combination of columns enabled and in different order.

How can I map the available columns to the full schema according to the Fields row within the log file?

Edit: Normally I would approach this by defining my column schema as a dict mapping column name to index. ie: col['date']=0 col['time']=1 etc. Then I would read the #Fields row from the file and parse out the enabled columns and generate header dict mapping header name to column index in the file. Then for remaining rows of data I know its header by index, map that to my column schema by header=column name and generate new row in correct order inserting missing columns with null data. My issue is I do not understand how to do this within hadoop since each map executes alone and therefore how can I share the #Fields information with each map?

回答1:

You can use this to apply the header to the columns creating a map. From there you can use a UDF like:

myudf.py

#!/usr/bin/python

@outputSchema('newM:map[]')
def completemap(M):
    if M is None:
        return None
    to_add = ['A', 'D', 'F']
    for item in to_add:
        if item not in M:
            M[item] = None
    return M

@outputSchema('A:chararray, B:chararray, C:chararray, D:chararray, E:chararray, F:chararray')
def completemap_v2(M):
    if M is None:
        return (None,
                None,
                None,
                None,
                None,
                None)
    return (M.get('A', None),
            M.get('B', None),
            M.get('C', None),
            M.get('D', None),
            M.get('E', None),
            M.get('F', None))

To add in the missing tuples to the map.

Sample Input:

csv1.in             csv2.in
-------            ---------
A|B|C               D|E|F
Hello|This|is       PLEASE|WORK|FOO
FOO|BAR|BING        OR|EVERYTHING|WILL
BANG|BOSH           BE|FOR|NAUGHT

Sample Script:

A = LOAD 'tests/csv' USING myudfs.ExampleCSVLoader('\\|') AS (M:map[]); 
B = FOREACH A GENERATE FLATTEN(myudf.completemap_v2(M));

Output:

B: {null::A: chararray,null::B: chararray,null::C: chararray,null::D: chararray,null::E: chararray,null::F: chararray}
(,,,,,)
(,,,PLEASE,WORK,FOO)
(,,,OR,EVERYTHING,WILL)
(,,,BE,FOR,NAUGHT)
(,,,,,)
(Hello,This,is,,,)
(FOO,BAR,BING,,,)
(BANG,BOSH,,,,)

来源：https://stackoverflow.com/questions/18343215/manipulate-row-data-in-hadoop-to-add-missing-columns

标签

Hadoop

Hive

apache-pig