Hive RegexSerDe Multiline Log matching

末鹿安然 提交于 2021-02-07 05:52:12

问题


I am looking for a regex that can be fed to a "create external table" statement of Hive QL in the form of

"input.regex"="the regex goes here"

The condition is that the logs in the files that the RegexSerDe must be reading are of the following form:

2013-02-12 12:03:22,323 [DEBUG] 2636hd3e-432g-dfg3-dwq3-y4dsfq3ew91b Some message that can contain any special character, including linebreaks. This one does not have a linebreak. It just has spaces on the same line.
2013-02-12 12:03:24,527 [DEBUG] 265y7d3e-432g-dfg3-dwq3-y4dsfq3ew91b Some other message that can contain any special character, including linebreaks. This one does not have one either. It just has spaces on the same line.
2013-02-12 12:03:24,946 [ERROR] 261rtd3e-432g-dfg3-dwq3-y4dsfq3ew91b Some message that can contain any special character, including linebreaks.
 This is a special one.
 This has a message that is multi-lined.
 This is line number 4 of the same log.
 Line 5.
2013-02-12 12:03:24,988 [INFO] 2632323e-432g-dfg3-dwq3-y4dsfq3ew91b Another 1-line log
2013-02-12 12:03:25,121 [DEBUG] 263tgd3e-432g-dfg3-dwq3-y4dsfq3ew91b Yet another one line log.

I am using the following create external table code:

CREATE EXTERNAL TABLE applogs (logdatetime STRING, logtype STRING, requestid STRING, verbosedata STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES
(
"input.regex" = "(\\A[[0-9:-] ]{19},[0-9]{3}) (\\[[A-Z]*\\]) ([0-9a-z-]*) (.*)?(?=(?:\\A[[0-9:-] ]{19},[0-9]|\\z))",
"output.format.string" = "%1$s \\[%2$s\\] %3$s %4$s"
)
STORED AS TEXTFILE
LOCATION 'hdfs:///logs-application';

Here's the thing:

It is able to pull all the FIRST LINES of each log. But not the other lines of logs that have more than one lines. I tried all links, replaced \z with \Z at the end, replaced \A with ^ and \Z or \z with $, nothing worked. Am I missing something in the output.format.string's %4$s? or am I not using the regex properly?

What the regex does:

It matches the timestamp first, followed by the log type (DEBUG or INFO or whatever), then the ID (mix of lower case alphabets, numbers and hyphens) followed by ANYTHING, till the next timestamp is found, or till the end of input is found to match the last log entry. I also tried adding the /m at the end, in which case, the table generated has all NULL values.


回答1:


There seem to be a number of issues with your regex.

First, remove your double square brackets.

Second, \A and \Z/\z are to match the beginning and end of the input, not just a line. Change \A to ^ to match start-of-line but don't change \z to $ as you do actually want to match end-of-input in this case.

Third, you want to match (.*?), not (.*)?. The first pattern is ungreedy, whereas the second pattern is greedy but optional. It should have matched your entire input to the end as you allowed it to be followed by end-of-input.

Fourth, . does not match newlines. You can use (\s|\S) instead, or ([x]|[^x]), etc., any pair of complimentary matches.

Fifth, if it was giving you single line matches with \A and \Z/\z then the input was single lines also as you were anchoring the whole string.

I would suggest trying to match just \n, if nothing matches then newlines are not included.

You can't add /m to the end as the regex does not include delimiters. It will try to match the literal characters /m instead which is why you got no match.

If it was going to work the regex you want would be:

"^([0-9:- ]{19},[0-9]{3}) (\\[[A-Z]*\\]) ([0-9a-z-]*) ([\\s\\S]*?)(?=\\r?\\n([0-9:-] ){19},[0-9]|\\r?\\z)"

Breakdown:

^([0-9:- ]{19},[0-9]{3}) 

Match start of newline, and 19 characters that are digits, :, - or plus a comma, three digits and a space. Capture all but the final space (the timestamp).

(\\[[A-Z]*\\]) 

Match a literal [, any number of UPPERCASE letters, even none, a literal ] and a space. Capture all but the final space (the error level).

([0-9a-z-]*) 

Match any number of digits, lowercase letters or - and a space. Capture all but the final space (the message id).

([\\s\\S]*?)(?=\\r?\\n([0-9:-] ){19},[0-9]|\\r?\\Z)

Match any whitespace or non-whitespace character (any character) but match ungreedy *?. Stop matching when a new record or end of input (\Z) is immediately ahead. In this case you don't want to match end of line as once again, you will only get one line in your output. Capture all but the final (the message text). The \r?\n is to skip the final newline at the end of your message, as is the \r?\Z. You could also write \r?\n\z Note: capital \Z includes the final newline at the end of the input if there is one. Lowercase \z matches at end of input only, not newline before end of input. I have added \z? just in case you have to deal with Windows line endings, however, I don't believe this should be necessary.

However, I suspect that unless you can feed the whole file in at once instead of line-by-line that this will not work either.

Another simple test you can try is:

"^([\\s\\S]+)^\\d"

If it works it will match any full line followed by a line digit on the next line (the first digit of your timestamp).




回答2:


Following Java regex may help:

(\d{4}-\d{1,2}-\d{1,2}\s+\d{1,2}:\d{1,2}:\d{1,2},\d{1,3})\s+(\[.+?\])\s+(.+?)\s+([\s\S\s]+?)(?=\d{4}-\d{1,2}-\d{1,2}|\Z)

Breakdown:

  • 1st Capturing group (\d{4}-\d{1,2}-\d{1,2}\s+\d{1,2}:\d{1,2}:\d{1,2},\d{1,3})
  • 2nd Capturing group (\[.+?\])
  • 3rd Capturing group (.+?)
  • 4th Capturing group ([\s\S]+?).

(?=\d{4}-\d{1,2}-\d{1,2}|\Z) Positive Lookahead - Assert that the regex below can be matched.1st Alternative: \d{4}-\d{1,2}-\d{1,2}.2nd Alternative: \Z assert position at end of the string.

Reference http://regex101.com/




回答3:


I don't know much about Hive, but the following regex, or a variation formatted for Java strings, might work:

(\d{4}-\d\d-\d\d \d\d:\d\d:\d\d,\d+) \[([a-zA-Z_-]+)\] ([\w-]+) ((?:[^\n\r]+)(?:[\n\r]{1,2}\s[^\n\r]+)*)

This can be seen matching your sample data here:

http://rubular.com/r/tQp9iBp4JI

A breakdown:

  • (\d{4}-\d\d-\d\d \d\d:\d\d:\d\d,\d+) The date and time (capture group 1)
  • \[([a-zA-Z_-]+)\] The log level (capture group 2)
  • ([\w-]+) The request id (capture group 3)
  • ((?:[^\n\r]+)(?:[\n\r]{1,2}\s[^\n\r]+)*) The potentially multi-line message (capture group 4)

The first three capture groups are pretty simple.

The last one might is a little odd, but it's working on rubular. A breakdown:

(                       Capture it as one group
    (?:[^\n\r]+)        Match to the end of the line, dont capture
    (?:                 Match line by line, after the first, but dont capture
        [\n\r]{1,2}     Match the new-line
        \s              Only lines starting with a space (this prevents new log-entries from matching)
        [^\n\r]+        Match to the end of the line            
    )*                  Match zero or more of these extra lines
)

I used [^\n\r] instead of the . because it looks like RegexSerDe lets the . match new lines (link):

// Excerpt from https://github.com/apache/hive/blob/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java#L101
if (inputRegex != null) {
  inputPattern = Pattern.compile(inputRegex, Pattern.DOTALL
      + (inputRegexIgnoreCase ? Pattern.CASE_INSENSITIVE : 0));
} else {
  inputPattern = null;
}

Hope this helps.



来源:https://stackoverflow.com/questions/17935200/hive-regexserde-multiline-log-matching

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!