Note: Article written in collaboration with Rootly originally featured on Rootly Blog.
Mature start-ups and scale-ups create wonderful and challenging environments for Engineers. As the product they’re creating matures and the brand becomes a successful one, the user base generally starts growing, and, for some companies, in places they might not expect it to grow. As that happens, new challenges arise for Engineers.
One of these challenges is pretty straightforward to guess. Basically having a particular product available throughout different regions of the world. This is one problem one doesn’t really start noticing until, one day:
- The product begins to have a huge number of customers/users in parts of the world that the company wasn’t really expecting.
- The Engineering team believes the system is up and running, just to learn minutes (or hours) later that it’s not actually available for anyone else. One simple reason why this might happen is related to the Production Monitoring systems being located in the same datacenter / same network as the systems to be monitored.
- The customers affected by this simply stop using the product
In this case, SREs or Platform Engineers that are either part of start-ups that want to grow and scale, or part of a more established companies that want to expand to new markets, need to prepare themselves not just to better react in this situation, but to be able to proactively tackle problems before they occur. In this context, there are many improvements that they can bring to the table, and in this article, if you’re one of these Engineers, I’d like to help you figure out one of these possible improvements: Monitoring your platform from multiple locations.
Where to start?
There are many tools out there that offer effortless (or almost) solutions to this problem, and most of them are right. Obviously each one of them comes with its own set of features that makes it unique: fancy integrations, different set of locations, pricing, etc. Speaking strictly from a technical perspective, there is indeed a very low complexity of spinning any of these solutions up. And I believe this is the beauty of technology: making Engineers focus on what is most meaningful in their activity, solving problems.
“So, if the challenge is not setting up the tool, where is it?” one might ask. Well, the challenge its actually about all the rest of the things but the tool. As in everything in Engineering: understanding what needs to be built and how its going to be used, before laying a finger on the implementation. There are multiple aspects to have in mind while ramping up a system that can monitor your platform from different locations. Getting a good grasp of the most important ones can help you (yeah you out there reading this!) speed up the roll-out of this in your organization.
1. Defining the “where”
Or, basically, compiling the list of locations. I believe this is one of the most challenging parts while configuring a worldwide monitoring system. In the end, from all the locations in this world, which ones should you care most for? Obviously, if you’re not a telecom company (or something similar), it’s not really feasible to actually monitor each tiny location on the globe.
To ease your job, first, start by understanding what matters to your business: things like top locations by user base, top locations by number of visits, top locations of customers with premium / enterprise accounts can help you understand where you need to put your focus. Also, looking at SLAs and legal agreements, does your company need to provide their services in a special location? These will help you create the right mix of locations the organization you’re building this for cares about.
Secondly, figuring out what a location means for your particular context: is it a city, a country or a region? I’m stressing out here once the context, as it can be any of the options or any combinations of the above, as this depends mostly on how your business operates. Let’s take the example of a food delivery company with a strong presence in metropolises / big cities. Monitoring the particular cities in this case might make more sense than regions, or countries, as the downtime of the platform in one of these bigger cities will be impacting the revenue much more than, let’s say, that of a couple of villages in a remote area of one region.
Last, but not least, now the locations are chosen, should all be treated in the same way? Are they equally important? Meaning, should alerts be triggered in the same way for all of those locations? In an ideal scenario, all the locations are treated in the same way. The worldwide monitoring system would alert in the same way if the platform is down, let’s say, in Barcelona or in San Francisco. This is important so Engineering can record if there’s actually a problem operating on the platform in certain key locations for the company. However, this doesn’t actually mean that for downtimes in each location an Engineer should wake up at 4AM. Setting up a series of priorities for your set of locations using the already existing priority system for your production incidents can help focus on what’s important first.
While doing this remember there are multiple departments that can be of assistance in the process of selecting the location, some that can provide you with the data (Customer Success, Sales, obviously Data, etc), some that can provide you with regulatory expectations (Security, obviously Legal), and some with technical information from the trenches (fellow SREs, Platform Engineers, Product Engineers that have been on call).
2. Figuring out the “how”
Once the location part is completely figured out, then the Engineer inside you will feel happy and completely ready to roll up their sleeves and implement the system. Along the path there will be some technical decisions you’ll be taking, and having access to any data on your systems will help you along the way.
One of the things you’ll need to figure out is how frequent should these monitors execute over the production instance of the platform? Looking at traffic is the best way to tackle this. A couple of users / minute might not give us enough reasons to monitor often. Usually the chosen locations are with high / intense traffic and generally at least once per minute should be enough to start with. A granularity above the minute, might make Engineers unaware for much more time that their systems are down in a particular area, and this is not convenient.
Another critical aspect to have in mind is setting up the alerting based on the monitors. And, generally, if the monitoring system should create an alert each time a failure is generated. I’d say not really, and this is for a couple of reasons. One reason is because of slight unavailability of the chosen tool in that particular location, making it not able to connect to your platform, even though your users are not actually facing any troubles doing that in that region. This shouldn’t create an alert, as it would spam the on-call engineers and, with other situations like this, they’ll start challenging the robustness of the tool. A good practice is to trigger an alert if a consecutive amount of failures is identified over that location, or to set up a retry.
Speaking of failures, it’s important to define what exactly should be considered one inside the system you’re creating. All the tools come with a default understanding of what a failure is, which is pretty standardized between them: basically errors in HTTP request, timeouts and everything you can imagine around this area. Of course this can be personalized with things such as custom error codes, lower the timeouts or latency or even assert something in the response. All of this depends also on SLAs and general policy regarding performance in your organization. For starters, going with the default in the beginning and learning from how the system behaves before going custom can be a gamechanger.
Another important aspect is figuring out what exactly should be monitored: an endpoint, a complete user journey, all of the exposed endpoints? To answer this question, one needs to think about the purpose of the worldwide monitoring system: checking the availability of the system from different parts of the globe. Modern tech organizations already have some testing in production going on which already tests some complete or partial user journeys, checking responses of APIs, so configuring another system in the same manner will not bring extra benefits. A good practice here is to keep it simple: the goal is to check if a particular part of the platform is accessible from different regions, then this becomes about figuring out what is the minimum that can be done to tackle that. Most of the time a simple API call / ping does the trick.
And no, you don’t need to monitor all of your endpoints. It might not be feasible to create monitors for each endpoint because of the quantity of them, their rapid change and, of course, money. In a monolithic architecture, one monitor on your main URLs should really do the trick. In a microservices architecture, it’s key to play an analysis game: understanding how those things are deployed, where, and what’s the connection between them. This creates an understanding of possible risks and allows you to reduce the quantity of monitors to a small, yet powerful, list that will allow you to avoid downtimes in different parts of the globe.
3. Clearing up the process
Rolling out a new system, especially in bigger organizations, requires time, alignment between multiple areas and, almost always, change of process
One critical aspect while talking about alerts, monitoring, on-call, failures is to understand who should react first if there is an availability problem in a specific world area? Who should ultimately own the investigation and fix it? In an ideal DevOps environment, it would definitely be the team owning that particular part of the platform. Realities in different Engineering organizations make this question to depend really on who has the access and knowledge to make those changes and to investigate. Here we can look at anyone from an SRE team, Ops, a specific team dedicated to Incident Management, and the list can continue. Nevertheless it’s critical to have someone reacting if the platform is down in an important location.
Not interfering with other systems or activities of other areas in the company, it’s key for the success of launching such an initiative, both from a process and technical perspective. As these monitors will be generating intense traffic, aspects such as web analytics or security rules should be taken into account. As one monitor per minute is generating almost half a million executions per year, and this multiplies by the number of monitors and location, this is a critical aspect to take into account. Flagging it it’s probably the best way to move forward, and this can be done in various ways: allow listing of certain IPs, adding certain bypass headers, setting up user agents, parametrizing requests, and so on.
When these monitors fail, they will start triggering an alert that will be yet-another-alert in the clutter of alerts the on-call engineer needs to go through. It’s more than critical to make sure they are aware about these and properly trained and equipped to fix them. Apart from this, integrating those monitors in the same system of alerts and attempting to group them all together will help the Engineers be fluent at tackling these situations as they appear.
Big or Small, Engineering organizations expanding to new markets face a lot of technical challenges, and one of them is keeping their systems available in different parts of the globe. Defining the locations where to monitor, figuring out the technical challenges and tackling some critical process changes are just one of the things that will enable a successful launch of a worldwide monitoring system. What would you make sure your Engineering Organization is handling while going global?