An outage on a Tuesday afternoon used to mean a frantic Slack message from a customer, then a scramble to figure out what broke, then a post-mortem that mostly concludes "we need better monitoring." The loop repeats. Real-time infrastructure monitoring breaks this loop: problems surface before customers notice them, on-call engineers have context when they are paged and RCAs happen in minutes instead of hours.

This guide covers what to instrument, how to build alerts that people actually act on, and how to move from reactive firefighting to proactive operations.

The Cost of Not Monitoring

Before covering how to do it, it is worth naming why it matters beyond the obvious.

Downtime cost: Industry estimates for mid-market SaaS put the cost of unplanned downtime at $5,000–$10,000 per hour. For enterprise segments, the number can be orders of magnitude higher. Even for smaller products, every hour of downtime is customer trust spent.

MTTR vs. MTTD: Mean time to repair starts from when a problem is detected, not when it starts. An outage that is detected in 2 minutes and resolved in 8 minutes is a much better outcome than one detected 40 minutes later and resolved in 10. Monitoring primarily compresses MTTD (detection time), which compresses MTTR even if resolution time is unchanged.

Debugging without instrumentation: Diagnosing a production incident without metrics, traces and logs is like diagnosing a mechanical problem with no gauges and no diagnostic tool. It is possible — but slow, error-prone and stressful.

The Three Pillars of Observability

Modern observability is built on three signal types that complement each other:

Metrics

Numeric measurements collected at regular intervals: CPU utilization, memory usage, request rate, error rate, response latency, queue depth, database connection pool usage. Metrics are cheap to store, fast to query and ideal for dashboards and threshold-based alerts.

The key distinction: Metrics tell you that something is wrong; they do not always tell you why.

Traces

Distributed traces follow a request through all the services it touches — from the incoming HTTP request at the edge, through authentication middleware, into the business logic service, into the database, and back. Each span records timing and metadata.

Traces answer questions that metrics cannot: "This API endpoint is slow — is it the database query, the external API call or the message queue? Which user's request? Which SQL query?" Without traces, answering these questions requires log grep and guesswork.

Logs

Structured log records capture events: an exception with its stack trace, a background job completing, a database migration running. Logs are the highest-detail signal — expensive to store at scale, but irreplaceable for diagnosing edge cases.

The principle: Use metrics to detect, traces to locate, logs to understand.

What to Instrument

Host and Infrastructure Metrics

Every server should emit:

  • CPU: utilization per core, steal time (for VMs — indicates hypervisor contention)
  • Memory: used, available, swap usage. High swap = memory pressure, which causes latency spikes.
  • Disk: utilization, I/O wait, read/write IOPS, free space. Disk full is one of the most common and most preventable outage causes.
  • Network: bytes in/out, connection count, TCP retransmits
  • Process: open file descriptors, running threads, memory RSS per process

Set up disk-full alerts with lead time — alert at 80 % used, not 95 %. Disks fill faster than you expect when log rotation fails or a runaway process writes to disk.

Application Metrics

At the application layer, instrument:

  • Request rate: requests per second broken down by endpoint and status code. A drop in request rate is often the first sign of an upstream problem or a routing change.
  • Error rate: 4xx vs 5xx, broken down by endpoint. A spike in 500s is an incident waiting for a page.
  • Latency: P50, P95, P99 by endpoint. P50 tells you the typical experience; P99 tells you the worst 1 % — enterprise customers often disproportionately feel the tail.
  • Queue depth and lag: For background job processors (BullMQ, Sidekiq, Celery), queue depth is a leading indicator. A queue growing faster than workers can drain it means something is slowing workers or a spike arrived.
  • Database: connection pool usage (near-full = request queueing), slow query count, replication lag for read replicas.
  • Cache hit rate: A cache hit rate drop on a frequently-accessed resource suddenly increases database load — often the first domino in a cascade.

Business Metrics

Application health metrics can be green while the business is broken. Monitor:

  • Failed payment attempts (a spike may indicate a payment processor issue)
  • Sign-up funnel completion rate (a drop may indicate a broken registration flow)
  • New user activation rate
  • Export job completion (a common "silent failure" that users only discover when they download an empty file)

Business metrics are your canary. A 20 % drop in checkout completions is more alarming than a CPU spike — even if CPU looks fine.

Alert Design: The Art of the Useful Page

Monitoring systems that page too often are monitoring systems that get ignored. On-call engineers learn to dismiss alerts when alert fatigue sets in — and the critical page gets dismissed with the noise.

Alert on Symptoms, Not Causes

Page on user-facing impact, not on internal metrics. "Error rate on /checkout exceeded 5 % for 5 minutes" is a symptom alert — it directly correlates with users experiencing failures. "CPU at 85 %" is a cause alert — it might be causing problems, or it might be an expected peak-load batch job.

Use cause-based alerts as informational (low-priority notifications in Slack) rather than pager events.

Severity Tiers

Maintain distinct severity levels with different notification paths:

SeverityTriggerResponse
Critical (P1)Active user impact (error rate, latency, downtime)Page on-call immediately
High (P2)Leading indicators of risk (queue depth, disk 80 %, cache miss spike)Notify in Slack, acknowledge within 30 min
Low (P3)Informational / trend trackingDaily digest, no immediate action

Deduplication and Grouping

A single underlying problem often fires dozens of correlated alerts. Without deduplication, on-call gets paged 30 times for one incident — and spends the first 10 minutes sorting out that all 30 pages are the same problem.

Configure your alerting system to group correlated alerts by service, time window and alert rule. Acknowledge the group, not each individual firing.

Runbooks

Every P1 and P2 alert should link to a runbook: a document that describes what this alert means, the most common causes, and the first 3–5 steps an on-call engineer should take. A runbook does not solve the problem; it compresses MTTR by eliminating the time spent figuring out where to start.

From Reactive to Proactive: Trend Analysis

The highest-leverage monitoring investment is anomaly detection that surfaces problems before they become incidents.

Disk growth projection: If disk usage is growing at 2 GB/day and current free space is 20 GB, you have 10 days. Alert now. Alert at 80 %, not 95 %.

Error rate trends: A 5xx error rate that is 0.1 % and growing 20 % per day is more concerning than a steady 0.5 %. Trend-based alerts catch gradual degradations that threshold-based alerts miss.

Memory leak detection: Memory RSS that increases monotonically over days without ever decreasing is a memory leak. Alert on this pattern before it causes an OOM crash.

Seasonal anomaly detection: Expected traffic patterns vary by time of day and day of week. A 30 % traffic drop at 2 PM on a Tuesday is normal if your users are on lunch break — or alarming if they are not. Anomaly detection that accounts for seasonality reduces false positives dramatically.

Tools: Self-Hosted vs Managed

Self-hosted stack: Prometheus (metrics collection), Grafana (dashboards), Loki (log aggregation), Tempo (traces), Alertmanager (routing). Full control, no per-seat or per-metric cost, runs on your infrastructure. Operational overhead is real — the monitoring system needs to be monitored.

Managed services: Datadog, New Relic, Grafana Cloud. Lower operational overhead, faster time to first dashboard. Cost scales with data volume and seats — can become significant at scale.

Hybrid: Self-host Prometheus + Grafana for infrastructure metrics (predictable, high volume, cost-sensitive), use a managed service for distributed tracing (lower volume, high value from smart UI).

For teams without dedicated platform engineering capacity, starting with a managed service and migrating to self-hosted as operational maturity grows is a pragmatic path.

Conclusion

Observability is not a feature — it is the foundation that makes every other feature safer to build and deploy. The teams that recover from incidents in minutes rather than hours are not the ones with fewer bugs; they are the ones who can see their system, understand what changed, and act with confidence.

The investment required is real: instrumentation takes time, alert tuning is an ongoing discipline, and runbooks need to be written and maintained. The return is equally real: lower MTTR, higher deployment confidence, and on-call engineers who can sleep through the night because they trust the system to tell them when something actually needs attention.