Custom Logs: What Events Should You Track?
It's not uncommon for developers to code their own logging modules into applications. We do this to compensate for limitations in operating system standard logging functionality which doesn't always provide the level of monitoring needed for an enterprise application. If you've decided that you need custom logs for your application, here are a few events that you don't want to skip.
Poor Performance and Load Times
An operating system will log any events where the application crashes due to a timeout, but these logs only tell you when the application crashes, not when it's showing signs of performance degradation. As your user base grows and the application supports more traffic, your infrastructure and resources should scale. It's not uncommon for infrastructure to have improper capacity planned, as well as be insufficiently setup to scale appropriately at the right triggers. These mistakes can lead to application timeouts.
With custom application logs, you can write events that give Ops a heads-up when the application is taking too long to load. You determine an acceptable time for pages to load, and then create events in a log that indicate the application is taking too long to process. For instance, if a page should only take five seconds to load, then you should write events to a log when load times reach three seconds. SRE can monitor these events and scale resources when the server needs better CPU power, memory resources, and bandwidth.
This eases stress and limits business revenue impact. It also helps to understand the application deeper as you track the metrics often daily fine tuning thresholds, times, types of count (min,max,median,avg) and the general runbook or automation strategies.
Failed Events During Checkout and Sales
Most application logs capture pages that crash, but the most important of these pages are the ones that guide a user through the buying process. It could be a page where a user contacts your sales team, or it could be one that takes a user through an e-commerce store purchase. When one of these pages crashes, you lose revenue. One way to salvage a lost sale is to log events and capture the data that would otherwise be lost. With the lost information, you can then contact the buyer and complete the purchase.
Go to any popular e-commerce store, and you'll see that the shopping cart process is usually cut up into sections where the user enters some information and clicks “Next” or “Continue” to go to the next step in the process. After each step, information is stored in the database. Even with this information, you still don't know why the application failed or at what step the user could no longer complete a purchase. Logging failed sales events lets you review information from a client and identify where the sales page failed. It could be from a user-generated error such as an expired credit card, but it could also be from a bug in your application. For instance, you could have a bug in the payment module in your application that rejects a legitimate credit card. Operating system logs won't record this type of error, but your custom application logs can help you identify these logic errors.
Cyber Security and Suspicious Events
Any application open to the public Internet should have heavy monitoring for security events. Logging suspicious events is one of the ways you can identify when a cybersecurity attack is occurring. You should have other types of security on the server, the network and coded into the application, but logging events can give you clues into an ongoing attack. Logging events can help you with forensics after a critical data breach.
The most difficult part about this type of logging is that identifying attacks from the application can lead to many false positives. Additionally, developers must know what constitutes a suspicious action and what is legitimate traffic. Developers are not usually cybersecurity experts, so it takes some collaboration between Dev, Ops + SecEng to get these logs right.
As a simple example, one suspicious event is a user that continues to enter an incorrect password. Some users simply forget their password or need a few attempts before they remember the right one. You could log the number of times a user incorrectly enters a password before they are successful. If these number of attempts adds up to a certain threshold, then it could alert administrators.
Logging single events is a good way to identify possible attacks, but most hackers launch attacks that go after numerous users at one time. A better way to identify attacks is logging events that could indicate a bot or program is being used to log into user accounts. These account takeover tools (ATOs) take a list of users, their credit card numbers and passwords and log into accounts continuously to test their viability. You can code this type of detection by understanding the way bots work.
One way to detect these tools is to log failed login events that have only a few milliseconds in between attempts. Another way is to log multiple failed credit card purchases in a short amount of time on a specific product. A third way to detect these attacks is to log useragent along with the failed login attempt. Usually, a bot does not pass a useragent variable to the website application. These types of events don't display common user behavior patterns, so they are useful for forensics or detecting a possible ongoing data breach.
What Should a Logged Event Include?
After you identify the events that you want to log, you need the right information in your log files for them to become effective forensic tools. You can log events to any location: a database, text files (make sure that they are secure), for Windows there's Windows Event Viewer, or choose many popular 3rd party logging tools that are available today (Graylog is one of my favorite open source logging tools). The following data points should be the baseline for your logs:
• IP address
• Date and time of event
• The application name (if you have multiple applications)
• The error returned by the application such as a critical failure or a simple warning
• Raw exception data from the operating system
• Customer number if you have it
• Session id (if you can is great for tracing across systems + microservices)
After you create your logs, you must aggregate them to a shared location. Remember that logged events are a source of data breaches, so they should be highly secured and restricted to only administrators. You can also use a third-party service to stream your logs to a SaaS system that helps you search them for specific events. Once you set up custom logs, you'll find that you can much more effectively identify security events, salvage lost sales, and debug your system before application errors cause critical downtime.