Metrics, Logs, and Traces on EKS: My Open Source Terraform Project (Part 1 of 5)

Most EKS clusters I inherit have the same problem: nobody knows what’s happening inside them.

There are pods running. Nodes are healthy (probably). CloudWatch has some logs (somewhere). But when something breaks at 2am, the team is SSH-ing into nodes, tailing logs manually, and guessing which service is the bottleneck.

I got tired of rebuilding the same observability stack from scratch on every engagement. So I built one, open-sourced it, and now I’m writing a five-part series walking through every component.

This is Part 1: the architecture, the design decisions, and why I built it the way I did.

The Problem With “We’ll Add Monitoring Later”

Every team says it. Nobody means it.

What actually happens: the application ships, traffic grows, and the first time something breaks, you discover that your “monitoring” is a CloudWatch dashboard somebody made six months ago that nobody looks at.

EKS makes this worse because Kubernetes already generates a mountain of signals (pod metrics, node metrics, control plane logs, API server audit trails) but none of it is useful unless you collect it, route it somewhere queryable, and set up alerts that actually fire.

The three pillars of observability are not optional on Kubernetes. You need all three:

Metrics: Is the system healthy right now? CPU, memory, request rates, error rates, latency percentiles.
Logs: What happened? Structured application logs, Kubernetes events, audit trails.
Traces: Why is this request slow? Which service in the chain is the bottleneck?

If you’re missing any one of these, you’re flying blind in at least one dimension.

What the Stack Does

The eks-observability-stack is a modular Terraform project that deploys a complete observability pipeline on EKS using AWS-managed services.

Here’s the architecture:

Metrics: Amazon Managed Prometheus (AMP) for storage, Amazon Managed Grafana (AMG) for dashboards. Pre-built dashboards for cluster health, node metrics, and application performance included out of the box.
Logs: Fluent Bit collecting from every pod, routing to either CloudWatch Logs or OpenSearch. You pick the backend with a single variable. Switching later doesn’t require re-architecting anything.
Traces: OpenTelemetry Collector running as a DaemonSet on every node, forwarding to AWS X-Ray. Tail-based sampling so you always capture errors and slow requests (100%) while sampling normal traffic at a configurable rate.
Alerting: Prometheus alerting rules in AMP, routed through SNS. Email subscriptions out of the box, optional PagerDuty integration. Seven production-ready rules included (high error rate, pod crash looping, node memory pressure, high latency, disk pressure, pods not ready, OOM kills).

Every component is a separate Terraform module. You enable what you need and skip what you don’t:

module "observability" {
  source = "github.com/salekali/eks-observability-stack"

  cluster_name              = "my-cluster"
  cluster_oidc_issuer_url   = "oidc.eks.ap-southeast-2.amazonaws.com/id/EXAMPLE"
  cluster_oidc_provider_arn = "arn:aws:iam::123456789012:oidc-provider/..."
  vpc_id                    = "vpc-0123456789abcdef0"

  enable_metrics    = true
  enable_logging    = true
  enable_tracing    = true
  enable_alerting   = true
  enable_sample_app = true

  alert_email = "oncall@yourcompany.com"
}

That’s it. One module call, and you have metrics, logs, traces, and alerting deployed across your cluster.

Why AWS-Managed Services (Not Self-Hosted)

I’ve run self-hosted Prometheus, Grafana, and ELK stacks on Kubernetes. In production. For real companies.

It works until it doesn’t. Self-hosted Prometheus needs persistent storage, retention management, and someone to deal with OOM kills when cardinality explodes. Self-hosted Grafana needs its own database, auth, and backup strategy. Self-hosted OpenSearch on Kubernetes is an operational nightmare that I wouldn’t wish on anyone.

AWS-managed services cost more per unit but cost less in total when you factor in the engineering time you’re not spending babysitting infrastructure that exists to monitor your other infrastructure.

The stack uses:

AMP instead of self-hosted Prometheus. Same PromQL, same remote-write protocol, zero operational overhead.
AMG instead of self-hosted Grafana. SSO integration, managed upgrades, no database to manage.
X-Ray instead of self-hosted Jaeger/Tempo. Native AWS integration, no storage to manage.
CloudWatch Logs or OpenSearch Service instead of self-hosted ELK. Both are managed, both scale automatically.

Design Decisions Worth Explaining

Everything Uses IRSA (No Static Credentials)

Every workload in the stack authenticates via IAM Roles for Service Accounts. No AWS access keys. No secrets stored in Kubernetes. Each component (Fluent Bit, OTel Collector, Grafana) gets its own IAM role scoped to exactly what it needs.

This is non-negotiable for regulated environments. If your observability stack requires static credentials, it’s a security liability.

Modules Are Independent

The metrics module doesn’t depend on the logging module. Tracing doesn’t depend on metrics. You can deploy any combination without the others.

The one useful dependency: if you enable both tracing and metrics, the OTel Collector will export span metrics to AMP so you get trace-derived RED metrics (rate, error, duration) in your Grafana dashboards. But it’s optional, not required.

Tail-Based Sampling for Traces

Most tracing setups use head-based sampling: decide at the start of a request whether to trace it. The problem is you end up sampling errors at the same rate as successful requests, which means you might miss the traces you actually need.

The stack uses tail-based sampling in the OTel Collector. It waits for all spans in a trace to arrive (10-second window), then applies three policies:

Always capture errors (100% of traces with error status)
Always capture slow requests (100% of traces over 500ms)
Probabilistic sample the rest (configurable, default 10%)

You never lose visibility into failures, and you keep costs manageable on high-volume clusters.

Private Cluster Support

Setting private_cluster_mode = true deploys VPC endpoints for every AWS service the stack needs (APS, CloudWatch, X-Ray, STS, EC2, S3). The entire observability pipeline works without any internet access. This matters for government, financial services, and anyone running in locked-down environments.

What’s Coming in This Series

This is a five-part series. Each post digs into one layer of the stack with real Terraform code and production configuration:

This post: Architecture overview and design decisions
Amazon Managed Prometheus + Grafana: Setting up the metrics pipeline with pre-built dashboards
Structured Logging: Fluent Bit configuration, CloudWatch vs OpenSearch, and how to switch backends
Distributed Tracing: OpenTelemetry Collector setup, tail-based sampling policies, X-Ray integration
Production Hardening: IRSA scoping, alerting rules, SNS routing, and what to check before going live

Every post links to the GitHub repo with the full Terraform source. This isn’t theory. It’s running in production.

Try It Yourself

Clone the repo and start with the metrics-only example:

git clone https://github.com/salekali/eks-observability-stack.git
cd eks-observability-stack/examples/metrics-only
terraform init && terraform apply

If you want the full stack (metrics, logs, traces, alerting, sample app), use the complete example:

cd examples/complete
terraform init && terraform apply

The README covers prerequisites, the two-phase apply process (needed for Grafana provider configuration), and all available variables.

Want This Deployed in Your Environment?

If you want this stack deployed on your EKS clusters, configured for your compliance requirements, and integrated with your existing infrastructure, book a free 30-minute discovery call. I’ll scope what you need and give you a straight answer on timeline and cost.

The full Terraform stack is on GitHub. Star it, fork it, open issues. It’s Apache 2.0 licensed.