IaC · February 14, 2025

Terraform at Scale:
Patterns That Actually Work

Raffael Hühnerschulte · 7 min read

Terraform is one of those tools where getting started takes an afternoon and getting it right takes years. The initial appeal is obvious: describe your infrastructure in HCL, run apply, and it exists. The problems start when you have three environments, a team of eight, and a state file that everyone is afraid to touch.

I've worked with teams at various stages of this journey. The structural mistakes are almost always the same — and they're almost always fixable before they become genuinely painful, if you know what to look for.

The State Problem Is Always Bigger Than You Think

Terraform state is the source of truth for what Terraform believes your infrastructure looks like. In a single-person project, a local terraform.tfstate file works fine. In any team context, it's a disaster waiting to happen: two people run apply at the same time, states diverge, resources get orphaned or double-created.

Remote state with locking is non-negotiable from the moment a second person touches your Terraform. The standard setup is an S3 bucket (or GCS, or Azure Blob) with a DynamoDB table for locking. Terraform Cloud and HCP Terraform handle this for you if you prefer a managed option.

Non-obvious problem A single large state file is itself an anti-pattern. When every resource in your infrastructure is in one state, every plan refreshes everything — slow, noisy, and risky. A mistake in one module can affect resources across the whole system. Split state by blast radius.

State Splitting: The Right Granularity

The goal of splitting state is to minimize the blast radius of a bad apply. A useful heuristic: split where the rate of change differs significantly and where resources don't need to reference each other directly at creation time.

State boundaryContainsChange frequency
FoundationVPCs, DNS zones, shared IAM, KMS keysRarely (months)
PlatformKubernetes cluster, databases, registriesOccasionally (weeks)
ApplicationDeployments, services, app-specific IAMFrequently (daily)
Per-environmentDev / staging / prod variants of the aboveParallel to above

Cross-state references use terraform_remote_state data sources or (better) SSM Parameter Store / Vault for loose coupling. Hard-coding ARNs or IDs across state boundaries defeats the purpose.

Module Design: The Mistakes Everyone Makes

Too much in one module

The first instinct is to write a module that creates everything a service needs: the database, the Kubernetes namespace, the IAM roles, the DNS record, the monitoring dashboard. This feels clean until you need to create a database without a monitoring dashboard, or you want to reuse the IAM role pattern independently. Modules should have a single responsibility — usually one infrastructure concept, not one application's needs.

Modules that are just wrappers with no abstraction

The opposite mistake: a module with 40 input variables that maps 1:1 to the underlying resource arguments. If you're not hiding complexity or enforcing constraints, you don't have a module — you have extra indirection. A good module exposes fewer variables than the underlying resource and encodes organizational decisions (naming conventions, tagging standards, security defaults) so callers don't have to think about them.

Not versioning modules

If your modules live in the same repository as your environments and you reference them with source = "../modules/vpc", every change to the module immediately affects every environment that uses it. Publish modules to a registry (Terraform Registry, a private registry, or even a separate Git repo with version tags) and pin versions in your environment configs. Upgrading then becomes an explicit, reviewable decision.

Anti-pattern

Running terraform apply manually from a developer's workstation in production. Manual applies with local credentials, no audit trail, and no review step cause the majority of production incidents I've seen in Terraform-managed infrastructure.

CI/CD for Terraform: The Baseline

The minimum viable pipeline for Terraform in production:

  1. On pull request: terraform fmt -check, terraform validate, terraform plan with the output posted as a PR comment. Nobody merges without a reviewed plan.
  2. On merge to main: terraform apply in a locked environment, with output logged and surfaced. No human runs apply locally against production.
  3. State locking enforced: The CI role has write access; developer roles have read-only or no access to production state.

Tools for this: Atlantis (self-hosted, excellent for GitHub/GitLab), Spacelift, HCP Terraform, or a custom pipeline in your existing CI system. The tool matters less than the discipline — plan-before-apply, always reviewed, never local.

Secrets: What Not to Do

Terraform state stores everything it manages, including resource attributes marked as sensitive. This means database passwords, API keys, and certificates end up in your state file in plaintext. Two rules:

Drift Detection

Infrastructure drift — the gap between what Terraform believes exists and what actually exists — is inevitable in any real environment. Console-created resources, hotfixes applied directly to cloud APIs, failed destroys that left orphaned resources: all of these cause drift.

Run scheduled terraform plan against your environments even when nothing is being changed. If the plan is non-empty, you have drift — and it's better to discover it proactively than during the next actual apply, when unexpected changes appear alongside your intended ones.

From experience The teams with the healthiest Terraform codebases share one trait: they treat their infrastructure code with the same discipline they apply to application code. Code review, version control, automated testing (terratest or similar), and a cultural norm that "if it's not in Terraform, it doesn't exist." That norm is harder to establish than any technical pattern — but it's the one that matters most.

Where to Go From Here

If you're just starting out: remote state and a basic CI pipeline are the two things that will save you the most pain. Get those right before anything else.

If you're hitting scale problems: audit your module boundaries, split your state files by change frequency, and implement drift detection. These three steps fix the majority of the structural issues I see in mature Terraform codebases.

Terraform scales well when you treat it as a software project — not as a collection of scripts that happen to provision infrastructure. The good news: the patterns that make it scale are well-established and learnable. The investment pays off quickly.

IaC training for your infrastructure team?

I run practical Terraform workshops covering state management, module design, CI/CD pipelines, and production-grade patterns — adapted to your cloud provider and your team's current setup.

Get in touch