Monitoring nginx (500's) with telegraf

梦想的初衷 提交于 2019-12-08 06:19:39

问题


I'd like to understand what my nginx instance is returning and who it's asking to handle requests. What fraction of my queries get handled by rails, what fraction are handled directly by nginx, what fraction are heading off to nginx_status, etc.

Similarly, I'd also like to understand things like how many of which HTTP result codes I'm returning. If there's a peak in 500's, I'd like to know.

The telegraf nginx plugin provides some very basic stats on nginx but no more. I've seen some vaguely complicated solutions for result codes that basically involve setting up log monitoring infrastructure. This data seems so fundamental I feel I must be missing something.

I've seen nothing that will help me understand who is actually handling queries (i.e., which handler).

All of this is interesting because (1) secular growth in handler dispatches can indicate scaling issues in clearer ways than simple load on the handler machines, and (2) peaks in anything can alert to problems.

Any pointers?


回答1:


You can let telegraf collect your nginx access logs. Then you can analyze how many requests had which HTTP status code (1xx, 2xx, etc).

Add this to your /etc/telegraf/telegraf.conf (and make sure telegraf has access rights to the logfile, it won't tell you, if it hasn't):

[[inputs.logparser]]
   files = ["/var/log/nginx/access.log"]
   from_beginning = true
   name_override = "nginx_access_log"

   [inputs.logparser.grok]
     patterns = ["%{COMBINED_LOG_FORMAT}"]
     measurement = "nginx_access_log"



回答2:


If it's actual I would like to write my own config. It based on tail telegraf plugin.

  1. Add speshial log format to http section
    log_format codes_combined 'code=$status ts=$time_iso8601';
  1. Use this format in server section
    access_log /var/log/nginx/codes.log codes_combined;
  1. Edit /etc/telegraf/telegraf.conf:
[[inputs.tail]]
    files = ["/var/log/nginx/codes.log"]
    data_format = "logfmt"
  1. After restart nginx and telegrap data should be avalable in Graphana. I configured a new graph
SELECT count("code") as code_2xx FROM "tail" WHERE $timeFilter AND code >= 200 AND code < 300 AND code  <> 204  GROUP BY time($__interval)
SELECT count("code") as  code_3xx  FROM "tail" WHERE $timeFilter AND code >= 300 AND code < 400 GROUP BY time($__interval)
SELECT count("code") as  code_4xx  FROM "tail" WHERE $timeFilter AND code >= 400 AND code < 500 GROUP BY time($__interval)
SELECT count("code") as  code_5xx  FROM "tail" WHERE $timeFilter AND code >= 500  GROUP BY time($__interval)
SELECT count("code") as code_204 FROM "tail" WHERE $timeFilter AND  code  = 204  GROUP BY time($__interval)
  1. Don't forger to check /etc/logrotate.d/nginx. Permisions should be like this:
create 0644 www-data adm


来源:https://stackoverflow.com/questions/49450336/monitoring-nginx-500s-with-telegraf

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!