问题
I'd like to understand what my nginx instance is returning and who it's asking to handle requests. What fraction of my queries get handled by rails, what fraction are handled directly by nginx, what fraction are heading off to nginx_status, etc.
Similarly, I'd also like to understand things like how many of which HTTP result codes I'm returning. If there's a peak in 500's, I'd like to know.
The telegraf nginx plugin provides some very basic stats on nginx but no more. I've seen some vaguely complicated solutions for result codes that basically involve setting up log monitoring infrastructure. This data seems so fundamental I feel I must be missing something.
I've seen nothing that will help me understand who is actually handling queries (i.e., which handler).
All of this is interesting because (1) secular growth in handler dispatches can indicate scaling issues in clearer ways than simple load on the handler machines, and (2) peaks in anything can alert to problems.
Any pointers?
回答1:
You can let telegraf collect your nginx access logs. Then you can analyze how many requests had which HTTP status code (1xx, 2xx, etc).
Add this to your /etc/telegraf/telegraf.conf
(and make sure telegraf has access rights to the logfile, it won't tell you, if it hasn't):
[[inputs.logparser]]
files = ["/var/log/nginx/access.log"]
from_beginning = true
name_override = "nginx_access_log"
[inputs.logparser.grok]
patterns = ["%{COMBINED_LOG_FORMAT}"]
measurement = "nginx_access_log"
回答2:
If it's actual I would like to write my own config. It based on tail telegraf plugin.
- Add speshial log format to http section
log_format codes_combined 'code=$status ts=$time_iso8601';
- Use this format in server section
access_log /var/log/nginx/codes.log codes_combined;
- Edit /etc/telegraf/telegraf.conf:
[[inputs.tail]]
files = ["/var/log/nginx/codes.log"]
data_format = "logfmt"
- After restart nginx and telegrap data should be avalable in Graphana. I configured a new graph
SELECT count("code") as code_2xx FROM "tail" WHERE $timeFilter AND code >= 200 AND code < 300 AND code <> 204 GROUP BY time($__interval)
SELECT count("code") as code_3xx FROM "tail" WHERE $timeFilter AND code >= 300 AND code < 400 GROUP BY time($__interval)
SELECT count("code") as code_4xx FROM "tail" WHERE $timeFilter AND code >= 400 AND code < 500 GROUP BY time($__interval)
SELECT count("code") as code_5xx FROM "tail" WHERE $timeFilter AND code >= 500 GROUP BY time($__interval)
SELECT count("code") as code_204 FROM "tail" WHERE $timeFilter AND code = 204 GROUP BY time($__interval)
- Don't forger to check /etc/logrotate.d/nginx. Permisions should be like this:
create 0644 www-data adm
来源:https://stackoverflow.com/questions/49450336/monitoring-nginx-500s-with-telegraf