How to provide reliable SAAS - External Services

If you work in a SAAS company, your company most likely relies on external services. Think of things like transactional email, credit card payments and compute resources. (e.g. sendgrid.netstripe.com). We rely on these providers because they do something that we can’t or don’t want to do ourselves. Service providers are essential to our SAAS businesses. Without them, we’d have to spend a lot of time on things that aren’t as special to our business. But our SAAS providers don’t always provide 100% perfect service. Sometimes we know about these problems and other times they happen silently. Either way, our customers are affected. Our customers are critical, they are the sole reason our company exists. We need to find ways to provide top-level service to our customers, even when our external services fail us.

Photo by jonathan riley on Unsplash

Who are your external services

Your external service providers or vendors are also SAAS companies. Think of companies like twilio.com or squareup.com. They’re SAAS companies with other SAAS companies as their customers. They’re just like us. They are ever-changing with new customers, government regulation and personnel turnover. But service providers are also quite different than us. Their services aren’t consumed by people. Their services get consumed by software and consumed in bulk. That means that when a couple of requests fail, it’s hardly a pin drop next to the rush of successful requests going by. It’s hard to be a good service provider. They need to tune into the metrics and always looking for patterns in the errors.

What to do when failure happens

We know that no SAAS company is perfect and some sort of outage is going to happen at some point. So let’s prepare for the inevitable. How you deal with the outage is critical. Dropping the ball will push your customers away to your competition. But handling things well can make your customers even more loyal to you. Your customers need to know that you’ll take good care of them. If you can do that when the going is tough, it will strengthen your relationships. Taking care of your customers during an outage is a no-brainer. To weather an outage and come out ahead, you need to do two things. You can’t have one without the other.

Photo by Jay Heike on Unsplash

Communicate with your customers

Communication is number one. As soon as you know that something is going sideways, communicate that. Communicate that to the people affected. Tell them that there’s a problem and that your team is working on it. That was a couple of things at once, let’s unpack that a bit.

Communicate with the people affected. If possible, only communicate with the people who are being annoyed by the problem. We don’t want to bother people who don’t care about the outage. A great way to do this is to place notification banners in your SAAS application that are easy to update. That way the people affected (the ones using your app) will get notified. Customers who aren’t online right now aren’t bothered with the unnecessary notification. If your outage originates with a background task that processes payments. Lookup the logs during the outage. And only send apologies to the people who had their payments delayed. Again, you’re notifying the people affected.

Tell them that there’s a problem. You don’t need to over-complicate this. Talk in terms that your customers can understand. Let them know that something’s not working right. It could be as simple as “email notifications from our billing module are delayed”. You might not know exactly what’s going wrong. But that’s ok, use the information that you have and be honest with your customers. “We’re experiencing some delay in our payment processing”. Often business owners or managers will stall at this point. They’ll say they need to wait for more information. You need to be smart about this. You don’t want to communicate something incorrectly because you jumped the gun. But, as long as your customers are being affected by whatever it is, tell them, quickly. The longer you wait the worse it looks on you.

And then, of course, you tell them that your team is working on it. Of course, your team is all over it. With their skills, you’ll be to the bottom of this in no time and everyone can go back to normal.

Part of good communication is keeping your entire team on the same page. Synchronize the knowledge between ops, development, support and even sales and marketing. If someone calls in from a free trial to talk to their salesperson, don’t leave that salesperson flat-footed. Tell them what’s going on. The support team can head off a lot of unnecessary investigation tickets when they’re in the loop.

But even good communication will only get you so far. You can do an excellent job communicating. But if you aren’t fixing the problem, people are going to get tired of waiting for you.

Fix the problem

You’re coordinating customer communication. But of course, your development team is working on the problem at the same time. You’re going to make sure they have the resources they need to rectify the situation. You can be liaising with vendors and service providers. Or you can work on creating space so the dev team can focus on this problem without interruption.

Staying ahead

There’s a lot to do when there’s an incident. There’s a lot of unknowns. There are communication and investigation that all needs to happen. All the while, your team is in the hot seat. Your customers grow impatient and start thinking about using other competitors. You do not want to be in this situation and if you find yourself there, you want it resolved ASAP. Before you find yourself in an incident there are a few things you can do to stay ahead of the problem.

Photo by Nik MacMillan on Unsplash

There are service provider status pages – but you already knew that. Status pages can give you information on what’s happening with your service provider. The problem with status pages is that usually, they’re out of date. Like there was an issue 6 hrs ago and it still hasn’t made it on the page. And there’s another problem. If there’s an issue with your specific account, usually it won’t be up on the status page. Status pages are great when providers do a good job of updating them, but most of the time they don’t.

You can bake some monitoring into your application. The monitoring will let you know when there are problems with your service provider. This can be very helpful because you get notified right away when there’s a problem. Track things like response time and HTTP statuses. Set up customized alerts so you know what’s going on right away. Now you know what’s going on faster. Then your dev team can start investigation way before the first customer notices. During this time you can get your other teams up to speed. You update support, sales and marketing teams before any customer reaches out! You’ve given yourself a 20 min head start on resolving the issue! Who doesn’t want a 20 min head start?

You can also bake some debug information into your application. Setup your application to log interactions that end in unexpected ways. Then when you have an incident with your service provider, you’re ready. You can pull the debug logs and send your findings to their support team. Giving detailed debug information will help your service provider resolve the issue faster. Fast issue resolution is in everyone’s best interest, any help we can give to our service provider is a win for us.

Photo by Chris Liverani on Unsplash

Building monitoring, alerting and log storage is takes engineering effort. It takes away the resources you would use on things like customer feature requests. That’s why I created https://statuslist.app. Status List uses your account-specific credentials to track endpoints for you. We poll your service providers’ endpoints for you. You get alerts right away when incidents happen, so you can have that head start. Status List monitors for HTTP status codes, uptime and response time. When there’s an alert we give all the detail you need. The alert has full HTTP transcripts of the interaction including event timestamps. You can use that to help your service provider get back up and running. Best of all, you can set up your monitors in 5 mins. Instead of spending weeks of your precious engineering resources, you can start today.

Providing reliable SAAS when you rely on external services can be tough. But if we leverage some of these smart strategies we can stay ahead of the problems. And who knows, you may even win the trust of your customers by the way you handle the incidents that come your way.

Hi, I'm the founder of Status List. My goal is to help SAAS companies spend less time dealing with incidents and more time building awesome software. If you like what you've seen here you can find me at hello@statuslist.app or on twitter @natebosscher