Parsing apache log files

前端 未结 6 911
你的背包
你的背包 2020-12-01 01:39

I just started learning Python and would like to read an Apache log file and put parts of each line into different lists.

line from the file

1

相关标签:
6条回答
  • 2020-12-01 02:05
    import re
    
    
    HOST = r'^(?P<host>.*?)'
    SPACE = r'\s'
    IDENTITY = r'\S+'
    USER = r'\S+'
    TIME = r'(?P<time>\[.*?\])'
    REQUEST = r'\"(?P<request>.*?)\"'
    STATUS = r'(?P<status>\d{3})'
    SIZE = r'(?P<size>\S+)'
    
    REGEX = HOST+SPACE+IDENTITY+SPACE+USER+SPACE+TIME+SPACE+REQUEST+SPACE+STATUS+SPACE+SIZE+SPACE
    
    def parser(log_line):
        match = re.search(REGEX,log_line)
        return ( (match.group('host'),
                match.group('time'), 
                          match.group('request') , 
                          match.group('status') ,
                          match.group('size')
                         )
                       )
    
    
    logLine = """180.76.15.30 - - [24/Mar/2017:19:37:57 +0000] "GET /shop/page/32/?count=15&orderby=title&add_to_wishlist=4846 HTTP/1.1" 404 10202 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"""
    result = parser(logLine)
    print(result)
    
    0 讨论(0)
  • 2020-12-01 02:05

    Add this in httpd.conf to convert the apache logs to json.

    LogFormat "{\"time\":\"%t\", \"remoteIP\" :\"%a\", \"host\": \"%V\", \"request_id\": \"%L\", \"request\":\"%U\", \"query\" : \"%q\", \"method\":\"%m\", \"status\":\"%>s\", \"userAgent\":\"%{User-agent}i\", \"referer\":\"%{Referer}i\" }" json_log
    
    CustomLog /var/log/apache_access_log json_log
    CustomLog "|/usr/bin/python -u apacheLogHandler.py" json_log
    

    Now you see you access_logs in json format. Use the below python code to parse the json logs that are constantly getting updated.

    apacheLogHandler.py

    import time
    f = open('apache_access_log.log', 'r')
    for line in f: # read all lines already in the file
      print line.strip()
    
    # keep waiting forever for more lines.
    while True:
      line = f.readline() # just read more
      if line: # if you got something...
        print 'got data:', line.strip()
      time.sleep(1)
    
    0 讨论(0)
  • 2020-12-01 02:08

    Use a regular expression to split a row into separate "tokens":

    >>> row = """172.16.0.3 - - [25/Sep/2002:14:04:19 +0200] "GET / HTTP/1.1" 401 - "" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827" """
    >>> import re
    >>> map(''.join, re.findall(r'\"(.*?)\"|\[(.*?)\]|(\S+)', row))
    ['172.16.0.3', '-', '-', '25/Sep/2002:14:04:19 +0200', 'GET / HTTP/1.1', '401', '-', '', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827']
    

    Another solution is to use a dedicated tool, e.g. http://pypi.python.org/pypi/pylogsparser/0.4

    0 讨论(0)
  • 2020-12-01 02:14

    I have created a python library which does just that: apache-log-parser.

    >>> import apache_log_parser
     >>> line_parser = apache_log_parser.make_parser("%h <<%P>> %t %Dus \"%r\" %>s %b  \"%{Referer}i\" \"%{User-Agent}i\" %l %u")
    >>> log_line_data = line_parser('127.0.0.1 <<6113>> [16/Aug/2013:15:45:34 +0000] 1966093us "GET / HTTP/1.1" 200 3478  "https://example.com/" "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.18)" - -')
    >>> pprint(log_line_data)
    {'pid': '6113',
     'remote_host': '127.0.0.1',
     'remote_logname': '-',
     'remote_user': '',
     'request_first_line': 'GET / HTTP/1.1',
     'request_header_referer': 'https://example.com/',
     'request_header_user_agent': 'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.18)',
     'response_bytes_clf': '3478',
     'status': '200',
     'time_received': '[16/Aug/2013:15:45:34 +0000]',
     'time_us': '1966093'}
    
    0 讨论(0)
  • 2020-12-01 02:19

    This is a job for regular expressions.

    For example:

    line = '172.16.0.3 - - [25/Sep/2002:14:04:19 +0200] "GET / HTTP/1.1" 401 - "" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827"'
    regex = '([(\d\.)]+) - - \[(.*?)\] "(.*?)" (\d+) - "(.*?)" "(.*?)"'
    
    import re
    print re.match(regex, line).groups()
    

    The output would be a tuple with 6 pieces of information from the line (specifically, the groups within parentheses in that pattern):

    ('172.16.0.3', '25/Sep/2002:14:04:19 +0200', 'GET / HTTP/1.1', '401', '', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827')
    
    0 讨论(0)
  • 2020-12-01 02:19

    RegEx seemed extreme and problematic considering the simplicity of the format, so I wrote this little splitter which others may find useful as well:

    def apache2_logrow(s):
        ''' Fast split on Apache2 log lines
    
        http://httpd.apache.org/docs/trunk/logs.html
        '''
        row = [ ]
        qe = qp = None # quote end character (qe) and quote parts (qp)
        for s in s.replace('\r','').replace('\n','').split(' '):
            if qp:
                qp.append(s)
            elif '' == s: # blanks
                row.append('')
            elif '"' == s[0]: # begin " quote "
                qp = [ s ]
                qe = '"'
            elif '[' == s[0]: # begin [ quote ]
                qp = [ s ]
                qe = ']'
            else:
                row.append(s)
    
            l = len(s)
            if l and qe == s[-1]: # end quote
                if l == 1 or s[-2] != '\\': # don't end on escaped quotes
                    row.append(' '.join(qp)[1:-1].replace('\\'+qe, qe))
                    qp = qe = None
        return row
    
    0 讨论(0)
提交回复
热议问题