DevOps Cloud AWS Terraform

Infrastructure as Code with Terraform: A Practical Guide

Terraform is easy to start with and surprisingly easy to get wrong at scale. This is what I've learned about state management, module design, and multi-account deployments from using it across production infrastructure — not from the documentation.

DA

Damilare Adekunle

· 5 min read

0 Comments

Short link: https://ddadekunle.com/p/2

Infrastructure as Code with Terraform: A Practical Guide

Terraform is one of those tools where the getting-started experience is deceptively smooth. You write a resource block, run terraform apply, and infrastructure appears. The rough edges show up later — when you're managing multiple environments, sharing infrastructure across teams, or trying to make changes to live production resources without breaking anything.

Most Terraform guides cover the syntax. This one covers the decisions.

Remote state is not optional

The default terraform.tfstate file written to your local directory is fine for a tutorial. It's a liability in production. State files contain sensitive data — resource IDs, output values, sometimes secrets — and a local state file means only one person can run Terraform safely at a time. If two people apply concurrently, the state corrupts.

Remote state with locking solves both problems. On AWS the standard setup is S3 for storage and DynamoDB for locking:

terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "prod/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

The DynamoDB table needs a single attribute — LockID (string) — as the partition key. When a terraform apply starts, it writes a lock entry. Any concurrent apply fails immediately with a lock conflict instead of silently corrupting state.

encrypt = true ensures state is encrypted at rest in S3. Always set this.

Partial backend configuration for multi-account setups

When you're managing infrastructure across multiple AWS accounts — say, separate dev and production accounts — you need separate state for each. The naive approach is different backend blocks per environment, but that means your Terraform source is different per environment, which defeats the point.

The better pattern is partial backend configuration: leave the backend block empty in source code and pass the actual values at init time.

# In your Terraform source — intentionally empty
terraform {
  backend "s3" {}
}
# Dev init
terraform init \
  -backend-config="bucket=dev-terraform-state" \
  -backend-config="key=service/terraform.tfstate" \
  -backend-config="region=us-east-1"

# Prod init
terraform init \
  -backend-config="bucket=prod-terraform-state" \
  -backend-config="key=service/terraform.tfstate" \
  -backend-config="region=us-east-1"

This makes it structurally impossible to accidentally apply against the wrong account's state. The source code is identical for both environments; the backend target is determined at init time by whoever is running the command. In a CI/CD pipeline, this becomes a parameter passed to the workflow — not something baked into the Terraform source.

Provider version pinning is not a style preference

Every provider has a version constraint that should be explicit and tight:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.40"
    }
  }
  required_version = ">= 1.7"
}

~> 5.40 allows patch updates (5.40.x) but not minor version bumps. This matters because AWS provider minor versions can introduce breaking changes in resource behaviour — a provider upgrade that worked fine in your test environment can fail against a live resource with different state.

The .terraform.lock.hcl file that Terraform generates should be committed to version control. It pins the exact provider version used at init time, so every environment — local, CI, production — uses the same binary. Don't add it to .gitignore.

Modules: what belongs in one and what doesn't

A Terraform module should encapsulate a unit of infrastructure that is always deployed together and always configured the same way — with inputs only for the things that genuinely vary between deployments.

The test I use: if every callsite would pass the same value, it shouldn't be an input — it should be a default or a hardcoded internal. If the calling code needs to think about it, it's complexity that belongs inside the module.

For a standard ECS service module, the internal defaults might be CPU allocation, memory, log retention policy, and ECR lifecycle rules. The inputs are the service name, environment, container env vars, and secrets. The caller specifies what's unique to their service; the module handles everything else.

What doesn't belong in a module: shared infrastructure. VPCs, ECS clusters, S3 buckets that multiple things use — these belong in their own Terraform root, applied once per environment, with outputs that modules consume. A module that creates its own VPC will fight with every other module that also wants to create a VPC.

Plan output is your audit trail

terraform plan produces a diff of what will change. Always review it before applying, and in a production context, save the plan output and apply from the saved plan:

terraform plan -out=tfplan
# Review the plan
terraform apply tfplan

Applying from a saved plan guarantees that what you reviewed is exactly what gets applied — no changes between plan and apply due to provider drift or a concurrent resource change. Without this, there's a window between plan and apply where something external can change the actual state of a resource, resulting in an apply that looks different from what you reviewed.

In CI/CD pipelines, the saved plan file can be uploaded as an artifact, reviewed by a second engineer, and then applied in a downstream job. This gives you a human-reviewed infrastructure change process without requiring manual Terraform runs.

terraform destroy is usually the wrong tool

Destroying and recreating infrastructure is rarely the right approach outside of ephemeral environments. For most resource changes, Terraform handles the update in-place or will replace only the specific resource that changed.

Where destroy is valuable: migration infrastructure that was purpose-built for a one-time operation. When I provision AWS DMS replication infrastructure for a database migration — the replication instance, endpoints, VPCs, bastion host — all of it is provisioned via Terraform with the explicit intent of destroying it cleanly after cutover. terraform destroy at the end of a migration is the correct and expected exit path.

For long-lived production infrastructure, prefer targeted replacements (terraform apply -replace=resource.name) over wholesale destroy-and-recreate. And never run terraform destroy on a production environment from a CI/CD pipeline without a mandatory approval gate.

Share

Twitter LinkedIn

Comments (0)

Comments are protected by anti-spam filters and rate limiting.

No comments yet. Start the discussion.