monitoring applications, uptime, log files, etc [closed]

问题

How do you monitor your application in production? logs, uptime, etc... (I would prefer an external application, free and open source)

For example, I would like

ability to send out alert if the application goes down
send alert if cpu usage > than a set threshold
send alert if memory usage > than a set threshold
send alert for error messages
must be configurable, maybe some errors send alert if occurs X times in Y time period

回答1:

What kind of application?

I've used Nagios in the past. It's free and open source. It allows you to setup alerts, monitor event logs, monitor application specific logs, as well as monitor the server infrastructure and network itself.

http://www.nagios.org/

回答2:

Many people are moving to data-oriented solutions. While most monitoring tools (nagios etc) provide static charts, they are more silo, conventional views - todays apps highly distributed and transactions span multiple servers and things can get crazy. For more advanded functionality that goes beyond kpis and simple apis you need to look to machine data solutions like Logscape or Splunk. They allow you to create dashboards etc which are flexible and can be interactively drilled down to provide very rich root cause analysis. Look at some of the apps on this page LogscapeApps

回答3:

Nagios is the way to go -- a bit of a learning curve, but customizable and powerful. Also has a server-side daemon which can monitor files, disk space etc.

回答4:

We have a custom in house built piece of monitoring software.

It monitors the event logs on our various live machines (and test) for errors produced by our web applications. All our web applications write any exceptions the error log. It also pings the servers and monitors drive space.

There is a client application on every dev machine that polls the server app that monitors all the servers we have defined. This client app runs in the task tray and pops up messages when anything is out of the norm so a dev sees it instantly. We can also see when testers come across errors and usually have a fix or at least a fix in progress by the time the tester even reports the error.

The server also emails out to a distribution group so that we can see important errors while not at work if we need.

It also has the ability to supress predefined exceptions / errors as well.

回答5:

Google Analytics???

来源：https://stackoverflow.com/questions/1015768/monitoring-applications-uptime-log-files-etc

标签

monitor