On actionable and actually useful logs

Responding to responses about my earlier thoughts on overlogging

A few months ago, I responded to a Twitter thread on application logging. My response was simple and as follows:

There was also a caveat I made about scale. Maybe for a 1000 requests per day product, it’s fine to log everything even including stack traces. Maybe not. But if you have ever worked on something at scale, you’d figure out how expensive logging can get. I will give two quick examples in the next two paragraphs.

Before starting Fluidcoins, I contracted for a company in Europe. The task was to refactor a ridiculously slow PHP app in Go. We were working ~14 hours/day. When it was time to launch, lead devops was away for some family reasons, the junior devops guy had to stand up to the task of pushing this to production. To keep the story short, the K8s manifest didn’t include the LOG_LEVEL env value as he thought it would default to error mode by default, but it meant trace/debug mode instead. It was time to rest and monitor for bugs and all that stuff but around 72 hours later, the devops started getting notifications of shooting past their monthly cloud costs. Trace/debug mode had put an extra $5K cloud payment on the bill because production didn’t set the log level to error only

The second example here was at my first software job - Summer of 2017. It was a very small team and I was also functioning as the devops guy. After a few months of developing this new product for a client, I ran this on a small Digitalocean droplet, the set up was extremely simple for the live demo we were preparing for the client.

My boss was a technical person so he decided to stress load. To give context into what I built, we needed to consistently generate 100-250 unique images per second, build a unique qrcode on each image amongst a bunch of other stuff. At peak load, we could expect anywhere around 750. My boss fired up Hey and the next thing i knew I got a phone call at 2am saying this is ridiculously slow. What new changes did you push to production. We have tested over and over again and we met the specifications of the client earlier

I went into panic mode. ssh’d into the server. Restarted the app while turning on pprof . The logging path was taking a lot of IO to get it’s work done. We were logging to disk at this time but the crux of the matter was we were logging a lot that over a long period of time, the entire machine would get locked up when we get buffer writes to disk. Swap memory would ( then occassionally ) get used completely

You can go ahead to say both of these issues were caused by trace/debug logging in production which were caused by misconfiguration and me not knowing any better at my first job. But this same scenario plays out if you log irrelevant data. Logging HTTP route access logs and status codes per request in prod? Logging irrelevant data that are utterlessly useless if you were to debug any issue?

At the end of the day, it is very easy to think you can log User just got here, User wants to sign up User just signed up , entered cache function, exited cache function, in production if you have never had to pay for Grafana cloud/Splunk or you have never been taxed to cut a 5 digit monthly cloud bills by 15-20%.

How I log

I don’t claim to have all the answers but I push to make all log lines boil down to answering a few questions:

Extra things i look at:

Final words

Software engineering has seen a whole lot of advancements. Use the right tool for the job. You need metrics? Prometheus. You need to visualize these metrics? Grafana. You want Tracing across multiple services? OpenTelemetry + Signoz is what you need. These are all different tools you need to wield strongly. Metrics are cheap to store. Instrumentation is relatively cheaper. How about consider those to find problems upfront most of the times?

You don’t want to shove and make a mess of your application logs. You want them to be consistent, easy to filter and grep, extremely detailed and actionable. And also with a high signal to noise ratio.

comments powered by Disqus


Sounds like fun? go ahead and sign up

    We won't send you spam. Unsubscribe at any time.