Terraform is one of those tools where getting started takes an afternoon and getting it right takes years. The initial appeal is obvious: describe your infrastructure in HCL, run apply, and it exists. The problems start when you have three environments, a team of eight, and a state file that everyone is afraid to touch.
I've worked with teams at various stages of this journey. The structural mistakes are almost always the same — and they're almost always fixable before they become genuinely painful, if you know what to look for.
The State Problem Is Always Bigger Than You Think
Terraform state is the source of truth for what Terraform believes your infrastructure looks like. In a single-person project, a local terraform.tfstate file works fine. In any team context, it's a disaster waiting to happen: two people run apply at the same time, states diverge, resources get orphaned or double-created.
Remote state with locking is non-negotiable from the moment a second person touches your Terraform. The standard setup is an S3 bucket (or GCS, or Azure Blob) with a DynamoDB table for locking. Terraform Cloud and HCP Terraform handle this for you if you prefer a managed option.
plan refreshes everything — slow, noisy, and risky. A mistake in one module can affect resources across the whole system. Split state by blast radius.
State Splitting: The Right Granularity
The goal of splitting state is to minimize the blast radius of a bad apply. A useful heuristic: split where the rate of change differs significantly and where resources don't need to reference each other directly at creation time.
| State boundary | Contains | Change frequency |
|---|---|---|
| Foundation | VPCs, DNS zones, shared IAM, KMS keys | Rarely (months) |
| Platform | Kubernetes cluster, databases, registries | Occasionally (weeks) |
| Application | Deployments, services, app-specific IAM | Frequently (daily) |
| Per-environment | Dev / staging / prod variants of the above | Parallel to above |
Cross-state references use terraform_remote_state data sources or (better) SSM Parameter Store / Vault for loose coupling. Hard-coding ARNs or IDs across state boundaries defeats the purpose.
Module Design: The Mistakes Everyone Makes
Too much in one module
The first instinct is to write a module that creates everything a service needs: the database, the Kubernetes namespace, the IAM roles, the DNS record, the monitoring dashboard. This feels clean until you need to create a database without a monitoring dashboard, or you want to reuse the IAM role pattern independently. Modules should have a single responsibility — usually one infrastructure concept, not one application's needs.
Modules that are just wrappers with no abstraction
The opposite mistake: a module with 40 input variables that maps 1:1 to the underlying resource arguments. If you're not hiding complexity or enforcing constraints, you don't have a module — you have extra indirection. A good module exposes fewer variables than the underlying resource and encodes organizational decisions (naming conventions, tagging standards, security defaults) so callers don't have to think about them.
Not versioning modules
If your modules live in the same repository as your environments and you reference them with source = "../modules/vpc", every change to the module immediately affects every environment that uses it. Publish modules to a registry (Terraform Registry, a private registry, or even a separate Git repo with version tags) and pin versions in your environment configs. Upgrading then becomes an explicit, reviewable decision.
Running terraform apply manually from a developer's workstation in production. Manual applies with local credentials, no audit trail, and no review step cause the majority of production incidents I've seen in Terraform-managed infrastructure.
CI/CD for Terraform: The Baseline
The minimum viable pipeline for Terraform in production:
- On pull request:
terraform fmt -check,terraform validate,terraform planwith the output posted as a PR comment. Nobody merges without a reviewed plan. - On merge to main:
terraform applyin a locked environment, with output logged and surfaced. No human runsapplylocally against production. - State locking enforced: The CI role has write access; developer roles have read-only or no access to production state.
Tools for this: Atlantis (self-hosted, excellent for GitHub/GitLab), Spacelift, HCP Terraform, or a custom pipeline in your existing CI system. The tool matters less than the discipline — plan-before-apply, always reviewed, never local.
Secrets: What Not to Do
Terraform state stores everything it manages, including resource attributes marked as sensitive. This means database passwords, API keys, and certificates end up in your state file in plaintext. Two rules:
- Never put secrets as Terraform input variables. They'll end up in the state and in your plan output. Use a secrets manager (HashiCorp Vault, AWS Secrets Manager) and reference secrets at runtime rather than provisioning them through Terraform variables.
- Encrypt your state backend. S3 buckets should have server-side encryption and bucket policies that restrict access to the CI role. Treat your state bucket with the same care as a secrets store — because that's effectively what it is.
Drift Detection
Infrastructure drift — the gap between what Terraform believes exists and what actually exists — is inevitable in any real environment. Console-created resources, hotfixes applied directly to cloud APIs, failed destroys that left orphaned resources: all of these cause drift.
Run scheduled terraform plan against your environments even when nothing is being changed. If the plan is non-empty, you have drift — and it's better to discover it proactively than during the next actual apply, when unexpected changes appear alongside your intended ones.
Where to Go From Here
If you're just starting out: remote state and a basic CI pipeline are the two things that will save you the most pain. Get those right before anything else.
If you're hitting scale problems: audit your module boundaries, split your state files by change frequency, and implement drift detection. These three steps fix the majority of the structural issues I see in mature Terraform codebases.
Terraform scales well when you treat it as a software project — not as a collection of scripts that happen to provision infrastructure. The good news: the patterns that make it scale are well-established and learnable. The investment pays off quickly.