Monitoring nginx (500's) with telegraf

问题

I'd like to understand what my nginx instance is returning and who it's asking to handle requests. What fraction of my queries get handled by rails, what fraction are handled directly by nginx, what fraction are heading off to nginx_status, etc.

Similarly, I'd also like to understand things like how many of which HTTP result codes I'm returning. If there's a peak in 500's, I'd like to know.

The telegraf nginx plugin provides some very basic stats on nginx but no more. I've seen some vaguely complicated solutions for result codes that basically involve setting up log monitoring infrastructure. This data seems so fundamental I feel I must be missing something.

I've seen nothing that will help me understand who is actually handling queries (i.e., which handler).

All of this is interesting because (1) secular growth in handler dispatches can indicate scaling issues in clearer ways than simple load on the handler machines, and (2) peaks in anything can alert to problems.

Any pointers?

回答1:

You can let telegraf collect your nginx access logs. Then you can analyze how many requests had which HTTP status code (1xx, 2xx, etc).

Add this to your /etc/telegraf/telegraf.conf (and make sure telegraf has access rights to the logfile, it won't tell you, if it hasn't):

[[inputs.logparser]]
   files = ["/var/log/nginx/access.log"]
   from_beginning = true
   name_override = "nginx_access_log"

   [inputs.logparser.grok]
     patterns = ["%{COMBINED_LOG_FORMAT}"]
     measurement = "nginx_access_log"

回答2:

If it's actual I would like to write my own config. It based on tail telegraf plugin.

Add speshial log format to http section

    log_format codes_combined 'code=$status ts=$time_iso8601';

Use this format in server section

    access_log /var/log/nginx/codes.log codes_combined;

Edit /etc/telegraf/telegraf.conf:

[[inputs.tail]]
    files = ["/var/log/nginx/codes.log"]
    data_format = "logfmt"

After restart nginx and telegrap data should be avalable in Graphana. I configured a new graph

SELECT count("code") as code_2xx FROM "tail" WHERE $timeFilter AND code >= 200 AND code < 300 AND code  <> 204  GROUP BY time($__interval)
SELECT count("code") as  code_3xx  FROM "tail" WHERE $timeFilter AND code >= 300 AND code < 400 GROUP BY time($__interval)
SELECT count("code") as  code_4xx  FROM "tail" WHERE $timeFilter AND code >= 400 AND code < 500 GROUP BY time($__interval)
SELECT count("code") as  code_5xx  FROM "tail" WHERE $timeFilter AND code >= 500  GROUP BY time($__interval)
SELECT count("code") as code_204 FROM "tail" WHERE $timeFilter AND  code  = 204  GROUP BY time($__interval)

Don't forger to check /etc/logrotate.d/nginx. Permisions should be like this:

create 0644 www-data adm

来源：https://stackoverflow.com/questions/49450336/monitoring-nginx-500s-with-telegraf

标签

nginx

nginx-reverse-proxy

Telegraf

telegraf-inputs-plugin

nginx-status