Terraform at Scale: Patterns That Actually Work

Terraform is one of those tools where getting started takes an afternoon and getting it right takes years. The initial appeal is obvious: describe your infrastructure in HCL, run apply, and it exists. The problems start when you have three environments, a team of eight, and a state file that everyone is afraid to touch.

I've worked with teams at various stages of this journey. The structural mistakes are almost always the same — and they're almost always fixable before they become genuinely painful, if you know what to look for.

The State Problem Is Always Bigger Than You Think

Terraform state is the source of truth for what Terraform believes your infrastructure looks like. In a single-person project, a local terraform.tfstate file works fine. In any team context, it's a disaster waiting to happen: two people run apply at the same time, states diverge, resources get orphaned or double-created.

Remote state with locking is non-negotiable from the moment a second person touches your Terraform. The standard setup is an S3 bucket (or GCS, or Azure Blob) with a DynamoDB table for locking. Terraform Cloud and HCP Terraform handle this for you if you prefer a managed option.

Non-obvious problem A single large state file is itself an anti-pattern. When every resource in your infrastructure is in one state, every plan refreshes everything — slow, noisy, and risky. A mistake in one module can affect resources across the whole system. Split state by blast radius.

State Splitting: The Right Granularity

The goal of splitting state is to minimize the blast radius of a bad apply. A useful heuristic: split where the rate of change differs significantly and where resources don't need to reference each other directly at creation time.

State boundary	Contains	Change frequency
Foundation	VPCs, DNS zones, shared IAM, KMS keys	Rarely (months)
Platform	Kubernetes cluster, databases, registries	Occasionally (weeks)
Application	Deployments, services, app-specific IAM	Frequently (daily)
Per-environment	Dev / staging / prod variants of the above	Parallel to above

Cross-state references use terraform_remote_state data sources or (better) SSM Parameter Store / Vault for loose coupling. Hard-coding ARNs or IDs across state boundaries defeats the purpose.

Module Design: The Mistakes Everyone Makes

Too much in one module

The first instinct is to write a module that creates everything a service needs: the database, the Kubernetes namespace, the IAM roles, the DNS record, the monitoring dashboard. This feels clean until you need to create a database without a monitoring dashboard, or you want to reuse the IAM role pattern independently. Modules should have a single responsibility — usually one infrastructure concept, not one application's needs.

Modules that are just wrappers with no abstraction

The opposite mistake: a module with 40 input variables that maps 1:1 to the underlying resource arguments. If you're not hiding complexity or enforcing constraints, you don't have a module — you have extra indirection. A good module exposes fewer variables than the underlying resource and encodes organizational decisions (naming conventions, tagging standards, security defaults) so callers don't have to think about them.

Not versioning modules

If your modules live in the same repository as your environments and you reference them with source = "../modules/vpc", every change to the module immediately affects every environment that uses it. Publish modules to a registry (Terraform Registry, a private registry, or even a separate Git repo with version tags) and pin versions in your environment configs. Upgrading then becomes an explicit, reviewable decision.

Anti-pattern

Running terraform apply manually from a developer's workstation in production. Manual applies with local credentials, no audit trail, and no review step cause the majority of production incidents I've seen in Terraform-managed infrastructure.

CI/CD for Terraform: The Baseline

The minimum viable pipeline for Terraform in production:

On pull request: terraform fmt -check, terraform validate, terraform plan with the output posted as a PR comment. Nobody merges without a reviewed plan.
On merge to main: terraform apply in a locked environment, with output logged and surfaced. No human runs apply locally against production.
State locking enforced: The CI role has write access; developer roles have read-only or no access to production state.

Tools for this: Atlantis (self-hosted, excellent for GitHub/GitLab), Spacelift, HCP Terraform, or a custom pipeline in your existing CI system. The tool matters less than the discipline — plan-before-apply, always reviewed, never local.

Secrets: What Not to Do

Terraform state stores everything it manages, including resource attributes marked as sensitive. This means database passwords, API keys, and certificates end up in your state file in plaintext. Two rules:

Never put secrets as Terraform input variables. They'll end up in the state and in your plan output. Use a secrets manager (HashiCorp Vault, AWS Secrets Manager) and reference secrets at runtime rather than provisioning them through Terraform variables.
Encrypt your state backend. S3 buckets should have server-side encryption and bucket policies that restrict access to the CI role. Treat your state bucket with the same care as a secrets store — because that's effectively what it is.

Drift Detection

Infrastructure drift — the gap between what Terraform believes exists and what actually exists — is inevitable in any real environment. Console-created resources, hotfixes applied directly to cloud APIs, failed destroys that left orphaned resources: all of these cause drift.

Run scheduled terraform plan against your environments even when nothing is being changed. If the plan is non-empty, you have drift — and it's better to discover it proactively than during the next actual apply, when unexpected changes appear alongside your intended ones.

From experience The teams with the healthiest Terraform codebases share one trait: they treat their infrastructure code with the same discipline they apply to application code. Code review, version control, automated testing (terratest or similar), and a cultural norm that "if it's not in Terraform, it doesn't exist." That norm is harder to establish than any technical pattern — but it's the one that matters most.

Where to Go From Here

If you're just starting out: remote state and a basic CI pipeline are the two things that will save you the most pain. Get those right before anything else.

If you're hitting scale problems: audit your module boundaries, split your state files by change frequency, and implement drift detection. These three steps fix the majority of the structural issues I see in mature Terraform codebases.

Terraform scales well when you treat it as a software project — not as a collection of scripts that happen to provision infrastructure. The good news: the patterns that make it scale are well-established and learnable. The investment pays off quickly.

The State Problem Is Always Bigger Than You Think

State Splitting: The Right Granularity

Module Design: The Mistakes Everyone Makes

Too much in one module

Modules that are just wrappers with no abstraction

Not versioning modules

CI/CD for Terraform: The Baseline

Secrets: What Not to Do

Drift Detection

Where to Go From Here

IaC training for your infrastructure team?