Komu W - Logging without losing money or context.

Logging without losing money or context. (26 May 2020)

Problem
Proposed Solution
Conclusion

Problem
In your application, you want all flows of execution to be covered with log statements. That way, if something were to go wrong, you can be able to trace at which point the flow broke at.
However, this presents a problem; if your application has lots of traffic, then the amount of logs it generates are going to be gigantic. This in itself is not a problem, were it not for the fact that you are using a logging as a service provider and they all charge an arm and a leg for every log event.
Some of the pricing models are so inscrutable, that the logging service providers offer calculators ( here, here, etc) to try and help you figure out what should ideally have been kindergaten arithmetic.

So it seems like for you to have your cake and eat it too, you have to part with large sums of money every month(and due to elastic pricing you can't even tell in advance how much that will be.)
There are two main ways that people try and solve this problem(usually at the suggestion of the said logging service providers);

Filtering logs by severity and only sending logs above a certain severity to the logging service provider.

Sampling logs so that you only send a certain percentage to the logging service provider.

But these two solutions pose a problem; loss of context.
Consider an application that updates multiple social media platforms with a status message.


func main() {
    updateSocialMedia("Sending out my first social media status message")
}

func updateSocialMedia(msg string) {
	traceID := "sa225Hqk" //should be randomly generated per call
	logger := logrus.WithFields(logrus.Fields{"traceID": traceID})

	tweet(msg, logger)
	facebook(msg, logger)
	linkedin(msg, logger)
}

func tweet(msg string, logger *logrus.Entry) {
	logger.Info("tweet send start")
	// code to call twitter API goes here
	logger.Info("tweet send end.")
}

func facebook(msg string, logger *logrus.Entry) {
	logger.Info("facebook send start")
	err := facebookAPI(msg)
	if err != nil {
		logger.Errorf("facebook send failed. error=%s", err)
	}
	logger.Info("facebook send end.")
}

func linkedin(msg string, logger *logrus.Entry) {
	logger.Info("linkedin send start")
	// code to call linkedin API goes here
	logger.Info("linkedin send end.")
}

If we were filtering logs and only sending logs of ERROR level to our logging service provider, then we would lose context on how the facebook send failed error came to be. I had previosuly written that logs are primarily used to help debug issues in production; thus, context and chronology of events that led to a particular issue are of importance. You do not want to investigate a murder mystery where half the clues have been deliberately wiped out by your earlier self.
In the same way, if we were sampling logs; the chronology leading upto the error would be missing a few INFO log statements since those would have been sampled out.

What we want is;
If the application has not emitted any errors, no logs gets sent to our logging service provider(INFO or otherwise.) However, if there are any errors emitted; all the logs leading to that error irrespective of their severity/level are sent to the logging service.

Proposed Solution
I think we should be able to implement such a logging scheme. The basic idea is; whenever your application emits log events, all the logs go into a circular buffer of size X. Whenever an error log is emitted, the whole circular buffer is flushed and all its contents are sent to the logging service provider.
The circular buffer can be in memory or on disk/whatever and the size is configurable.

I took a stub at implementing this using sirupsen/logrus which is a popular logging library for the Go programming language, but the implementation should be transferable across libraries/languages.
In sirupsen/logrus, you can declare a hook implementing the custom behaviour that you want.


package main

import (
    "io"

    "github.com/sirupsen/logrus"
)

// hook to buffer logs and only send at right severity.
type hook struct {
    writer io.Writer

    // Note: in production, lineBuffer should use a circular buffer instead of a slice.
    // otherwise you may have unbounded memory growth.
    // we are just using a slice of []bytes here for brevity and blogging purposes.
    lineBuffer [][]byte
}

// Fire will append all logs to a circular buffer and only 'flush'
// them when a log of sufficient severity(ERROR) is emitted.
func (h *hook) Fire(entry *logrus.Entry) error {
    line, err := entry.Bytes()
    if err != nil {
        return err
    }
    h.lineBuffer = append(h.lineBuffer, line)

    // if the current log event is of level ERROR  or a higher severity,
    // then flush the buffer; thus emitting all the logs.
    if entry.Level <= logrus.ErrorLevel {
        var writeError error
        for _, line := range h.lineBuffer {
            _, writeError = h.writer.Write(line)
        }
        h.lineBuffer = nil // clear the buffer
        return writeError
    }

    return nil
}

// Levels define on which log levels this hook would trigger
func (h *hook) Levels() []logrus.Level {
    return logrus.AllLevels
}

And the way to use it in your application is;


package main

import (
    "errors"
    "io/ioutil"
    "math/rand"
    "os"
    "time"

    "github.com/sirupsen/logrus"
)

func main() {
    // send logs to nowhere by default
    logrus.SetOutput(ioutil.Discard)
    logrus.SetFormatter(&logrus.JSONFormatter{})
    
    // Use our custom hook that will append logs to a circular buffer
    // and ONLY flush them to stderr when errors occur.
    logrus.AddHook(&hook{writer: os.Stderr})

    updateSocialMedia("Sending out my first social media status message")
}

Now, if any error occurs; all the logs and chronology leading upto the error are available and are sent to the logging service provider.


go run .

{"level":"info","msg":"tweet send start","time":"2020-05-25T21:03:36+03:00","traceID":"sa225Hqk"}
{"level":"info","msg":"tweet send end.","time":"2020-05-25T21:03:36+03:00","traceID":"sa225Hqk"}
{"level":"info","msg":"facebook send start","time":"2020-05-25T21:03:36+03:00","traceID":"sa225Hqk"}
{"level":"error","msg":"facebook send failed. error=http 500","time":"2020-05-25T21:03:36+03:00","traceID":"sa225Hqk"}

Conversely, if there are no errors been emitted by your application then no logs are sent to your logging service provider.


go run .

# no log output

So you do not end up spending tons of money and you also do not lose context when errors occur.

Conclusion:
You can implement a logging strategy that loses you neither money nor context.
Indeed, this is how logging in the Apollo Guidance Computer worked:

Another function served by these auxiliary erasable memories was called the "Coroner,"" which recorded every instruction executed, with its inputs and results, writing over the oldest such record when it filled up. When a program "crashed," you could punch out a full record of what it was doing in most of its last second and analyze the problem at your ease. I have often wished that PCs offered such an advanced feature. - Annotations to Eldon Hall's Journey to the Moon. Chapter 13, page 155

A downside of the presented solution is that it can be hard to tell if there are no logs because the application has not emitted any errors or because the logging pipeline itself has a bug. However this can be solved by emitting a heartbeat log event every Y minutes and letting this heartbeat propagate upto the logging service provider.

All the code in this blogpost, including the full source code, can be found at: https://github.com/komuw/komu.engineer/tree/master/blogs/09

You can comment on this article by clicking here.