It’s 4PM on a Friday.
Everyone is either wrapping things up or on their way out the door.
Unnoticed by any of the development team, a tiny bit flipped and the web application their business runs stops working. Blissfully unaware, the team heads home without a care in the world. Monday morning doesn’t go so well. The e-commerce site they run has just missed out on the $30k worth of orders they usually get over the weekend. Directors are furious; demand answers. Customers are complaining. People are scrambling. This is not a good place to be. All of this could have been avoided with some basic uptime monitoring.
Why use Uptime Monitoring?
What’s the point of uptime monitoring? So you know when your web application fails so you can fix the darn thing before bad things happen. Like general chaos on a Monday morning. Or customers clogging up support. Like your boss firing you. And it can happen to anyone. Even companies like Twitter, Google and Amazon have all recently had major outages (article links: Twitter, Google, Amazon). Whether you’ve got a little blog on the side, a big e-commerce site or a mission critical business application, you need appropriate uptime monitoring.
Know when your web application fails so you can fix the darn thing before bad things happen.
What should I monitor?
What to monitor depends on the complexity of your app. A static file server is monitored different than a dynamic web application. Think of the adage ‘If it’s not covered by tests, then it’s broken’. You should think of your monitors as unit tests for your production environment. Monitors are constantly checking that everything is working correctly (and telling you when they aren’t). So let’s get into specifics.
Monitor the Network
Let’s start at base level. Can I reach the server and have a process accept the HTTP connection? This is the basis of any HTTP service. Whether you’re running a static site or the most complex application, you need a connection between your process and the outside world. Without it, you have nothing. (Think of all those times you spent playing the jumping dinosaur game on Chrome). With a network monitor we’re watching a few things. Is the network path to the server up? My server on/responsive? Is my HTTP process running?
Where should I be using a network monitor? Monitor each process listening on the HTTP ports in your application. If you have a static service, that might only be 1 NGINX process. If you have a load balancer and/or micro-services, you’re going to setup monitors for each process that you can. With auto-scaling you can’t really re-configure your monitoring each time you scale. In that case, hit the load balancer more frequently. The increased frequency of the checks will cycle through the processes behind the load balancer.
Monitor Critical Application Functions
Let’s move up to the application level. The HTTP process is replying, but that doesn’t mean things are working correctly. Anyone who’s seen a fail whale on Twitter can tell you that. There’s a bunch of fundamental things to check here. Pick what applies to you. Can my server process access the filesystem it correctly? Does my server process talk to the database(s) correctly? …can I talk to the upstream micro-service(s)? If I rely on an external service (like S3 or SendGrid), is my server process configured correctly to talk to them? Are those external services up?
Health Check Endpoint
Health check endpoints are HTTP endpoints created for uptime monitoring. (e.g. app.myservice.com/health) In health check endpoints you’ll run a series of basic tests. Are my components (database, file system…) working? If everything passes the test, the endpoint returns an HTTP 200. It’s a good idea to also include a breakdown of the test results in the response. That makes debugging things easier down the road.
Health check endpoints are helpful, but they aren’t a perfect test. Your database may be responding to pings, but failing on inserts. The file system may show that you have permission, but it may be out of disk space.
Think of a health check like checking your temperature. If you have a fever, something is definitely wrong. But, if you don’t have a fever, that doesn’t mean everything is ok. Health checks are helpful, but real data checks are where it’s at.
One last thing on health checks. If you have the engineering resources, this is something you should setup. It’s pretty minimal effort compared to a real data check. And, it’s much more thorough than a smoke check.
Real Data Check
In a real data check, we run real requests to test our application. The idea is to test functions that users are actually using. This test is perfect for things like login, register and e-commerce cart functions. In a register test, we create a dummy account every time we run the check.
Real data checks are ideal because they test that the function is working. The only downside is that setting up test scenarios for each function may be difficult. Some uptime monitoring services like Status List make this easier for you. Status List allows you to specify custom authentication headers and request parameters.
You can really go in deep with lots of real data checks. I would recommend at the very least you setup a test for one mission critical function. (e.g. login, add to cart, load analytics). If you have the resources, test each major function in your service.
A smoke check is the same as a smoke test in unit testing. Send a request to an simple endpoint and see if it works. On an e-commerce site, check that the home page is up. On a business application, check that the dashboard doesn’t timeout.
Smoke checks are helpful because it will catch the system-wide problems. Things like your database going down. Or network connectivity to your upstream microservice failing. Smoke checks are also great because they’re so stinking easy to set up! Throw in a HTTP GET request to your favorite data-pulling endpoint. That will test your database and server for critical failures.
Monitor DNS Changes
DNS can be confusing. Mistakes happen. People won’t be able to access your site. Monitor your DNS.
Uptime monitors like Status List can check this for you if you set it up. It’s pretty easy actually. Create a monitor each A/AAA/CNAME record on your system. e.g. sample.com, app.sample.com, us-east.sample.com, upstream-video-transcoding.sample.com. Then, when the DNS points to the wrong place, you’ll know about it.
At the very least you should have a monitor that checks your primary domain. If your primary domain is down, it’s going to be real bad if no one notices right? Monitor it.
You wouldn’t believe how many HTTPS certs aren’t renewed on time! This affects everyone, even Microsoft misses this sometimes. But it’s so easy to throw an uptime monitor on that.
On uptime monitoring services like Status List, each request will check that the certificate is valid. Create a monitor for each HTTPS certificate and HTTPS termination point you have. One termination point, one domain – create one monitor for that domain. One termination point, multiple domains on a wildcard certificate – create one monitor for one domain on the certificate. Multiple termination points, wildcard certificate – create a monitor that will hit each termination point.
So our system is up and functioning. The next thing to look out for is performance bottlenecks. Performance problems can come from many sources. A database table may have grown and introduced slowness on certain queries. Customer load might spike on a CPU intensive task like image processing. Those problems aren’t always easy to diagnose, so we want to know about them early!
What to Monitor for Performance
There’s a couple endpoints we should always be monitoring for performance. Marketing landing pages, dashboards and frequently used business logic are a great place to start. Let’s go through why. Studies show that marketing sites actually lower the conversion rate per millisecond increase in response time. Let me say that another way. As the response time slows on a marketing site, your company loses money!
Ok, so what else. We should monitor our dashboards. If there’s a database bottleneck, that’s probably where it’s going to show first. And lastly, our frequently used business logic should be monitored. Basically we’re checking that the critical functions in our service are running well.
If you have the resources it can also be helpful to monitor the login and health check endpoints for performance. Typically, I wouldn’t include them because they don’t have the same performance requirements that other parts of the app do. But if you have the time, wire them up too.
Setting up Performance Monitoring
How do I set it up? Most uptime monitoring services make it quite easy to check performance. Services like Status List allow you to specify a response time threshold on your monitors. This means that if several requests to a monitor take longer than X ms, we’ll get notified.
Your site needs to be running at the same high level regardless of where it’s being viewed. An outage only affecting Australia could go undetected by your US-based team for days. If your reach is global, your clientele are global, your service needs to be up globally.
How do I choose which regions to target? This really depends on where your target market is. If you’re selling American tax software, you may only want to check the US. Running an email service? Your clients may live in the US. But they’re going to travel to places like Europe, Australia etc and want access to their email.
Some services will offer uptime checks in every little country on the globe. You don’t need to monitor them all if you don’t have a presence there. Just choose regions that represent your user-base. If you get the odd person in Indonesia, maybe you just monitor Papua New Guinea. Not every island in Indonesia.
How do you set it up? Many uptime services will have a series of checkboxes in their monitor configuration. Select all the countries you’d like to monitor. You’re off to the races!
June 29th – Add HTTPS and DNS
July 7th – Include performance monitoring
July 15th – Multi-Region Monitoring
…this article is growing. Get notified when we post an update