The Hidden Trap of Migrating to Private NLBs

Moving from development to production usually means one thing: taking an architecture where everything is public and locking it down. I recently did this for a containerised API running on ECS Fargate — migrating from a public-facing Application Load Balancer to a private architecture using API Gateway and an internal Network Load Balancer. On paper the plan was solid. In Terraform it looked clean. After terraform apply, the service was down.

Here's what actually happened.

The Architecture Shift

Before:

Public Internet ? Internet Gateway ? Public ALB ? ECS Fargate (Public Subnet)

After:

Public Internet ? API Gateway ? VPC Link ? Internal NLB ? ECS Fargate (Private Subnet)

The motivation for this change was twofold. First, I wanted to use API Gateway's usage plans for per-client rate limiting — something an ALB can't do natively. Second, moving ECS tasks into private subnets with no public IP is the correct security posture for a production service: the application should be unreachable from anything except the load balancer in front of it.

API Gateway communicates with private resources through a VPC Link, which requires a Network Load Balancer. The ALB had to go.

After applying the Terraform plan, the API Gateway returned a generic 500 Internal Server Error. CloudWatch integration logs showed the backend was failing. The target group in EC2 showed all Fargate tasks as unhealthy.

Trap 1: NLBs Preserve the Source IP

My first instinct was a security group problem. I looked at the ECS task security group — it allowed inbound traffic from the VPC CIDR, which is where the VPC Link lives. That should cover it.

It didn't.

The thing about Network Load Balancers is that they operate at Layer 4 (TCP), not Layer 7 (HTTP). Unlike an Application Load Balancer, which terminates the connection and forwards requests from its own IP, an NLB passes through the original client's source IP. When a request comes in from API Gateway, the ECS task sees the source as the VPC Link's ENI IP address. My security group rule allowed that.

The health checks, however, are a different story. NLB health checks don't originate from the VPC Link. They originate from the load balancer nodes themselves — specifically from the NLB's own IP addresses within the subnet. Because I hadn't created a rule allowing traffic from the NLB's security group, every health check was being silently dropped by the firewall. All targets stayed permanently unhealthy. No traffic was ever forwarded.

The fix was updating the ECS task security group to accept inbound from two sources:

# Allow application traffic via VPC Link
ingress {
  from_port   = 8000
  to_port     = 8000
  protocol    = "tcp"
  cidr_blocks = [var.vpc_cidr]
}

# Allow NLB health checks from the load balancer nodes
ingress {
  from_port       = 8000
  to_port         = 8000
  protocol        = "tcp"
  security_groups = [aws_security_group.nlb.id]
}

After this change, health checks started passing and traffic began flowing. But there was still an intermittent connectivity issue.

Trap 2: Over-Securing an Internal Load Balancer

In trying to lock down the NLB's security group, I restricted its inbound rules to specific IP ranges, expecting to tighten the perimeter. This caused intermittent failures because routing through VPC Links can be opaque — the source IPs of requests coming through the VPC Link aren't always predictable.

The realisation was simple once I thought through the topology: an internal NLB has no public IP address. It is, by definition, only reachable from within the private network. Setting its inbound security group to 0.0.0.0/0 doesn't expose it to the internet — it just means "if you're already inside this private network and can route to me, you're allowed in."

The VPC boundary and subnet routing are what enforce the isolation. The security group on an internal load balancer is a second layer, not the primary control. Trying to use it as the primary control introduces brittleness without adding meaningful security.

resource "aws_security_group" "nlb" {
  name   = "nlb-sg"
  vpc_id = var.vpc_id

  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"] # Internal load balancer — not reachable from outside the VPC
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

What to Remember

Private networking on AWS requires thinking in layers. The network topology — private subnets, VPC-only routing, internal load balancers — provides one layer of isolation. Security groups provide a second. Mixing up which layer is responsible for what leads to either broken connectivity or a false sense of security.

The two rules that would have saved me the debugging session:

NLBs preserve source IPs. Your ECS security group needs to allow traffic from both the upstream source (API Gateway / VPC Link) and the load balancer nodes themselves for health checks. These are different origins.

Internal load balancers are isolated by topology, not by security groups. If it has internal = true, it has no public IP. The internet cannot reach it regardless of what the security group says. Don't over-restrict and introduce fragility in the name of security that's already provided elsewhere.

The Hidden Trap of Migrating to Private NLBs

The Architecture Shift

Trap 1: NLBs Preserve the Source IP

Trap 2: Over-Securing an Internal Load Balancer

What to Remember

Comments (0)

More from this topic

Why PGOUTPUT Beats PGLOGICAL For Supabase Migrations

The Zero-Edit Merge Strategy Explained

DMS Replication Slots: What Nobody Tells You