amcneil36

Monitoring

Motivation

When introducing a new feature, you test it out and see that it works on your phone. You eventually have your feature rolled out to some users. How do you know whether the feature is working for these users are not? Someone may choose to just sit and wait to see if any bug reports come in or if anyone complains about the feature on Twitter. Though, you don’t want to have to wait for a user to report an issue. Often times, a user will just be frustrated with the bug and just close out of the app without reporting it. Or, maybe the user sees the bug but doesn’t realize it’s a bug. “Maybe this is just how they intended it to work” the user says to themself. So we need a way to quickly and easily determine whether our feature is working and a way to be informed immediately if there is any indication that the feature is not working.

Monitoring overview

This is where monitoring comes in. You want to have monitoring that you can look at to let you know whether a feature is working. Suppose you introduced a feature that results in a table displaying on the user’s screen with a form that the user can submit. You may have logging in place that logs whenever a user submits the form. When releasing this feature to some users, you can then monitor the logging to validate whether the feature is working as intended or not. So, you look at the logging and see that the form is submitted 5M times per day. Does that mean the feature is working? Well, not necessarily. It often takes logging in multiple different locations to validate whether a feature is working. So, for example, you might also log when the table is displayed. You might compare the number of table displays to the number of times the page is loaded up that is supposed to contain the table. You may also have logging that checks whichever data is inserted into the DB when the form is submitted to see whether the form was submitted properly.

Additional monitoring recommendations

Often times, you have logging that says how many times a certain line of code is hit. There may be logging in multiple different places and a huge dropoff in the number of times a log line is hit in one location may be that something is going wrong such as a component not loading up for some users in certain circumstances. So, it could be the case that there are crashes preventing a component from loading up. How can I become aware of these crashes as well as their potential root cause? You usually want code to throw an exception when something goes wrong. The exception would then be caught, logged, and handled at some location in the code. The exception message and it’s stack trace could be logged so that a software engineer can troubleshoot. Each company will usually have some tooling where all of their exceptions are logged to so that a user can search for exceptions by exception type, by file name that is included in the exception, etc. In addition to this, there are typically views available where you can, for example, see how many exceptions have been thrown at a certain time. Maybe an excpetion started getting thrown on August 2nd so you start looking at the commit history to see code changes that were done around August 2nd or you start looking to see if any A|B tests started around August 2nd.

This monitoring isn’t only useful for when you are testing out a new feature but also for when you are monitoring existing features. Existing features can randomly break at any time, unfortunately. And it may not even be due to a code change. One pitfall that happens is that there may be logging in place for monitoring features but that logging may not always be looked at by anyone. Suppose your logging indicates that a feature of yours stopped working one week ago but you aren’t aware of that because you didn’t look at the logs recently. You could come up with an idea of checking all of the logs manually each morning to make sure everything looks okay but that is tedious. So what you want to do here is have a way for your monitoring framework to ‘alert’ you or a specific person if something goes wrong. This alert may come in the form of an email, a text message, a push notification, or another form depending on the severity. This way, you don’t have to check your logging periodically. You can simply just go about your other duties and then react to any issue when you receive an alert. You want to have some way to configure thresholds for alerting. For example, you may not need to be notified about a particular exception happening 3 times over a 24 hour period. But you may want to be notified if a particular exception occurs 5.3M times over a 24 hour period. Or maybe if an error occurs 2M times over a five minute period. There’s a judgement call to be made about what the thresholds should be for alerting based on how bad the impact is of something going wrong.