Background image: ZeroFail Background image: ZeroFail
Social Icons

Hunting production bugs V2.0

3 min read
Image of: Ahmad Ahmad

Back in 2018, things were a little bit different. AI wasn't that big of a thing—
the so-long StackOverflow was still your assistant.

I had to ask questions around why things happen the way they occur—in programming, of course. Otherwise, you'd get harassed.

In the original article I wrote in 2018, I told three stories—only one of them is still considered valid today.

"It's always in the logs."

Observability

Let's be real—do you actually have an application if you don't know what's happening? And, do you really know what's happening without continuously observing and checking whether the application is running as intended?

I once read:

If you don't have observability, you don't have an app.

How would you know if something is broken?
Wait until a user tells you.?
Then you lose credibility.

Wait until the system is down?
You lose accountability every minute it's down, especially if it's critical, like a financial app.

What's observability?

It's the pillars through which you gain insights into how your application behaves and how the underlying system behaves.

It's mostly logs and metrics.

Sometimes you'll hear people talk about telemetry data: traces, or error tracking. These are part of observability, just not the pillars.

Let's talk about Logs and Metrics and why they matter.


Metrics

💡
I will ELI5 just for simplicity, please don't be offended.

When you develop an app, you usually want to host it somewhere central, accessible through the internet 24/7. That "somewhere" is just a computer, usually a powerful one. We call it a server.

This server will host your application for its lifetime—maybe until the end of the internet.

This computer has a disk, memory, processor, etc. These components are utilized and stressed with every request that hits your application.

Take the following chart as an example:

                     +-----------------------+
                     |     User Request      |
                     |    (Browser / App)    |
                     +----------+------------+
                                |
                                v
                    +-----------+-----------+
                    |       Web Server      |
                    |     (Your App)        |
                    +-----------+-----------+
                                |
            +------------------+------------------+
            |                                     |
            v                                     v
+------------------------+        +---------------------------+
|   Filesystem Access    |        |     Database Querying     |
|  (e.g., static images) |        |  (Read/Write operations)   |
+------------------------+        +---------------------------+
            \                                     /
             \                                   /
              \                                 /
               \                               /
                v                             v
                    +-----------------------+
                    |     Data Processing   |
                    |  (App Logic, CPU use) |
                    +-----------------------+

In this diagram, a request hits the server. It fetches data from the filesystem, like images. It reads data from the DB hosted on the same machine. It also processes the data.

All of these mechanics require constant observation:
- Is the disk getting full?
- Can the processor handle the traffic?
- Is the database down?

Just like a car dashboard tells you the gas tank is low or the tires need air, your server needs that kind of visibility.

These are the system metrics in this simple system. You store them in a time-series database like Prometheus and act on them. For instance, set an alert when the filesystem is almost full, the CPU is maxing out, or when the database is down.

Application Metrics

You also have application metrics—these are emitted from your app itself to help you know what's happening inside.

You might think, "But we have logs?"

What about Logs

Well, logs are events. They tell you in detail how your system behaves with each request. Metrics are snapshots, collected periodically to give you an overview of performance and health.

Logs are expensive. Very expensive.
Metrics are not that expensive.

Why? Because of the logs' nature. They're text files full of every system event and request; they are storable, indexable, and searchable via systems like Elasticsearch. But you pay a lot to keep them.

That's why you always have a retention policy—you keep them for 3 to 6 months, then delete them.

Metrics have a smaller footprint. They're stored in time-series databases at regular intervals—say, every 15 seconds.

You might keep them for a day, a month, or even 6 months. They're cheaper, but they also lose relevance over time. So you remove them regularly.

Logs stay relevant much longer, but they're costly.

Coming up

In the upcoming series, we'll talk about how to set up metrics, logs, and traces.

How do you scale them, make them highly available, and wire your apps to them?

When they're important—and when they're not.

Tagged in:

articles

Last Update: May 19, 2025

Author

Ahmad 3 Articles

👋 I’m Ahmad, an SRE & Programmer with 10+ years of experience scaling reliable, cost-efficient infrastructure. I specialize in automation, observability, and delivering practical, lasting solutions.

Subscribe to our Newsletter

Subscribe to our email newsletter and unlock access to members-only content and exclusive updates.

Comments