Website Monitoring Strategy
As corporates and small business engage the web more and more heavily, the need to ensure a website is up at all times becomes more and more critical. There are a number of services (wikipedia has a great list) that offer to monitor website availability by using various traffic types (ping, http, etc) to ensure the site is available to the public. At what point should you use one of these services, and additionally, what kind of monitoring should you do (assuming you have the appropriate network infrastructure) to perform monitoring also. If you are a company with some internal hosting capability or with a number of servers hosted in a data centre somewhere, then I think implementing a solution as illustrated below is wise.
Monitoring Architecture
The design of this architecture is focused on getting the best “bang for your buck” monitoring. That is, services that you can monitor from your own internal network you should, but obviously you want something keeping an eye on the the monitoring service (who watches the watchmen?). This is where the services listed in the wikipedia article should be investigated. In addition to implementing the external monitors to ensure the internal monitoring services are operating, it is probably worthwhile having the external service monitor some of the key services also (you usually get more that one monitored service with hosted monitoring solutions).
In terms of how you monitor the internal solution, there are a number of options, but depend on how you can set things up. The simplest strategy that is fairly effective, is to write a small file (deleting any existing file first) on one of the monitored webservers as one of the actions the internal monitoring server completes during its “rounds”. That webserver should then have a very lightweight script deployed to it to check that the creation date of the file is within certain tolerances (say up to 5 minutes). This script is an example of what I call a “full stack health check”, which I cover in the next section.
Implementing “Full Stack Health Checks”
In my opinion, the most efficient and effective way to implement health-checking on web and application servers is to write lightweight scripts that do what your application is going to do. For instance, a health checking script that is being used to monitor the health and availability of a web application that uses a database (fairly common), should at least make a connection to the database and perform a simple select statement to ensure that the database is available as well as the webserver. In most instances, it can be possible to attach a monitoring service to the main page of a website to achieve a similar result; however, it is common for high-traffic webservers to use caching mechanisms to improve response times and in these cases, a monitoring solution would not pick up a problem that will show itself to the public soon enough. Given this example, implementing health check scripts and deploying them in their own directory on the webserver (and setting up some caching exceptions where appropriate) might give you a heads-up to fixing a problem before the general site public even notice.
When writing a “full stack” check, my recommendation would be to simply return the text “OK” in the case that the stack of services validated in the call are operational. When a particular component of the stack is not operational, however, rather than simply returning “Broken” (or something similar) you should return something helpful, and this could be the text of an exception message, or something that you have determined to be the cause of an error with that component. This failure message text (if kept short) can be emailed and SMSed to the on-call staff, and give them a heads up as to any other key members of staff they might need to involve to fix a problem. For example, a message such as “database blah could not accept connection” could give a web support programmer the heads up they will probably need to get a database administrator on the phone to help with assisting the problem.
Build a Monitoring Catalogue
Still on my to-do list is to build a catalogue of monitoring end-points that relates back to critical websites and applications that I am responsible for maintaining at the company where I work. This catalogue is then used to ensure that suitable coverage of key services is covered by whatever monitoring strategy is put in place.



very good stuff, we learn a lot from here,,,,,keep it up ,