Parsing Snort Logs with PyParsing

后端 未结 3 1466
别那么骄傲
别那么骄傲 2021-02-04 15:52

Having a problem with parsing Snort logs using the pyparsing module.

The problem is with separating the Snort log (which has multiline entries, separated by a blank line

相关标签:
3条回答
  • 2021-02-04 16:37

    You have some regex unlearning to do, but hopefully this won't be too painful. The biggest culprit in your thinking is the use of this construct:

    some_stuff + Regex(".*") + 
                     Suppress(string_representing_where_you_want_the_regex_to_stop)
    

    Each subparser within a pyparsing parser is pretty much standalone, and works sequentially through the incoming text. So the Regex term has no way to look ahead to the next expression to see where the '*' repetition should stop. In other words, the expression Regex(".*") is going to just read until the end of the line, since that is where ".*" stops without specifying multiline.

    In pyparsing, this concept is implemented using SkipTo. Here is how your header line is written:

    header = Suppress("[**] [") + Combine(integer + ":" + integer + ":" + integer) + 
                 Suppress("]") + Regex(".*") + Suppress("[**]") 
    

    Your ".*" problem gets resolved by changing it to:

    header = Suppress("[**] [") + Combine(integer + ":" + integer + ":" + integer) + 
                 Suppress("]") + SkipTo("[**]") + Suppress("[**]") 
    

    Same thing for cls.

    One last bug, your definition of date is short by one ':' + integer:

    date = integer + "/" + integer + "-" + integer + ":" + integer + "." + 
              Suppress(integer) 
    

    should be:

    date = integer + "/" + integer + "-" + integer + ":" + integer + ":" + 
              integer + "." + Suppress(integer) 
    

    I think those changes will be sufficient to start parsing your log data.

    Here are some other style suggestions:

    You have a lot of repeated Suppress("]") expressions. I've started defining all my suppressable punctuation in a very compact and easy to maintain statement like this:

    LBRACK,RBRACK,LBRACE,RBRACE = map(Suppress,"[]{}")
    

    (expand to add whatever other punctuation characters you like). Now I can use these characters by their symbolic names, and I find the resulting code a little easier to read.

    You start off header with header = Suppress("[**] [") + .... I never like seeing spaces embedded in literals this way, as it bypasses some of the parsing robustness pyparsing gives you with its automatic whitespace skipping. If for some reason the space between "[**]" and "[" was changed to use 2 or 3 spaces, or a tab, then your suppressed literal would fail. Combine this with the previous suggestion, and header would begin with

    header = Suppress("[**]") + LBRACK + ...
    

    I know this is generated text, so variation in this format is unlikely, but it plays better to pyparsing's strengths.

    Once you have your fields parsed out, start assigning results names to different elements within your parser. This will make it a lot easier to get the data out afterward. For instance, change cls to:

    cls = Optional(Suppress("[Classification:") + 
                 SkipTo(RBRACK)("classification") + RBRACK) 
    

    Will allow you to access the classification data using fields.classification.

    0 讨论(0)
  • 2021-02-04 16:42

    Well, I don't know Snort or pyparsing, so apologies in advance if I say something stupid. I'm unclear as to whether the problem is with pyparsing being unable to handle the entries, or with you being unable to send them to pyparsing in the right format. If the latter, why not do something like this?

    def logreader( path_to_file ):
        chunk = [ ]
        with open( path_to_file ) as theFile:
            for line in theFile:
                if line:
                    chunk.append( line )
                    continue
                else:
                    yield "".join( *chunk )
                    chunk = [ ]
    

    Of course, if you need to modify each chunk before sending it to pyparsing, you can do so before yielding it.

    0 讨论(0)
  • 2021-02-04 16:52
    import pyparsing as pyp
    import itertools
    
    integer = pyp.Word(pyp.nums)
    ip_addr = pyp.Combine(integer+'.'+integer+'.'+integer+'.'+integer)
    
    def snort_parse(logfile):
        header = (pyp.Suppress("[**] [")
                  + pyp.Combine(integer + ":" + integer + ":" + integer)
                  + pyp.Suppress(pyp.SkipTo("[**]", include = True)))
        cls = (
            pyp.Suppress(pyp.Optional(pyp.Literal("[Classification:")))
            + pyp.Regex("[^]]*") + pyp.Suppress(']'))
    
        pri = pyp.Suppress("[Priority:") + integer + pyp.Suppress("]")
        date = pyp.Combine(
            integer+"/"+integer+'-'+integer+':'+integer+':'+integer+'.'+integer)
        src_ip = ip_addr + pyp.Suppress("->")
        dest_ip = ip_addr
    
        bnf = header+cls+pri+date+src_ip+dest_ip
    
        with open(logfile) as snort_logfile:
            for has_content, grp in itertools.groupby(
                    snort_logfile, key = lambda x: bool(x.strip())):
                if has_content:
                    tmpStr = ''.join(grp)
                    fields = bnf.searchString(tmpStr)
                    print(fields)
    
    snort_parse('snort_file')
    

    yields

    [['1:486:4', 'Misc activity', '3', '08/03-07:30:02.233350', '172.143.241.86', '63.44.2.33']]
    
    0 讨论(0)
提交回复
热议问题