regex for access log in hive serde

前端 未结 3 1788
生来不讨喜
生来不讨喜 2020-12-06 19:16

I want to extract out (ip, requestUrl, timeStamp) from the access logs to load to hive database. One line from access log is as follows.


66.249.68.6 - -          


        
相关标签:
3条回答
  • 2020-12-06 19:50

    Use double '\' and '.*' in the end (it's important!):

    CREATE EXTERNAL TABLE access_log (
            `ip`                STRING,
            `time_local`        STRING,
            `method`            STRING,
            `uri`               STRING,
            `protocol`          STRING,
            `status`            STRING,
            `bytes_sent`        STRING,
            `referer`           STRING,
            `useragent`         STRING
            )
        ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
        WITH SERDEPROPERTIES (
        'input.regex'='^(\\S+) \\S+ \\S+ \\[([^\\[]+)\\] "(\\w+) (\\S+) (\\S+)" (\\d+) (\\d+) "([^"]+)" "([^"]+)".*'
    )
    STORED AS TEXTFILE
    LOCATION '/tmp/access_logs/';
    

    P.S. Hive 0.7.1

    0 讨论(0)
  • 2020-12-06 20:04

    I use rubular to test my regex. You can also use this expression

    ([^ ]*) ([^ ]*) ([^ ]*) (?:-|\[([^\]]*)\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*)
    

    You get the following output

    1.  66.249.68.6
    2.  -
    3.  -
    4.  14/Jan/2012:06:25:03 -0800
    5.  "GET /example.com HTTP/1.1"
    6.  200
    
    0 讨论(0)
  • 2020-12-06 20:12

    Not fool-proof, but given that it is a log file in a known format then the following should work (untested in Hive, but works with grep -E and with http://www.regexplanet.com/simple/index.html if you replace [^[] with [^\[] and [^]] with [^\]]). Assumes you only want the three values you specifically mentioned.

    "input.regex" = "([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)[^[]+\[([^]]+)\][^/]+([^ ]+).+"
    "output.format.string" = "%1$s %2$s %3$s"
    
    0 讨论(0)
提交回复
热议问题