Moving from Development to Production often means one thing: taking your "everything is public" architecture and locking it down.

Recently, I migrated my Recommendation API from a public-facing Application Load Balancer (ALB) to a private architecture using AWS API Gateway and an Internal Network Load Balancer (NLB).

On paper, the plan was solid. In Terraform, it looked clean. In reality, it took down the service. Here is the technical breakdown of why it failed, and how I fixed it.

The Architecture Shift

Before: Public Internet -> Internet Gateway -> Public ALB -> ECS Fargate (Public Subnet)

After: Public Internet -> API Gateway -> VPC Link -> Internal NLB -> ECS Fargate (Private Subnet)

I switched to a REST API Gateway to leverage API Keys and Usage Plans for rate limiting. Because API Gateway communicates with private resources via a "VPC Link," I had to swap my Layer 7 ALB for a Layer 4 NLB.

The "500 Error" Mystery

After applying the Terraform plan, the API Gateway returned a generic 500 Internal Server Error.
The API Gateway logs said the integration was failing.
The Target Group showed all Fargate tasks as Unhealthy.

Trap #1: The Source IP Confusion

I configured my ECS Security Group to allow traffic from the VPC CIDR (where the API Gateway VPC Link lives). I assumed this would cover everything.

The Gotcha: Network Load Balancers (NLBs) preserve the client IP address.
When a request comes from the API Gateway, the ECS task sees the Source IP as the VPC Link ENI. My Security Group allowed this.

However, NLB Health Checks do not come from the client. They originate from the Load Balancer nodes themselves. Because I hadn't explicitly allowed traffic from the Load Balancer's Security Group, the health checks were being blocked by the firewall.

The Fix:
I modified the ECS Security Group to accept ingress from two sources:

The VPC CIDR (for the actual API traffic).

The Load Balancer Security Group (specifically for health checks).

Trap #2: Over-Securing the Internal

In my zeal for security, I tried to lock down the Internal NLB's security group to specific IP ranges. This caused intermittent connectivity issues because routing through VPC Links can be opaque.

The Realization:
An AWS Load Balancer with internal = true does not have a public IP address. It is physically impossible to route to it from the internet.

By setting the Internal NLB's Security Group ingress to 0.0.0.0/0, I wasn't opening it to the world. I was simply saying, "If you are already inside this private network and can route to me, come on in."

Conclusion

When working with AWS networking, "Least Privilege" is the goal, but "Functional Connectivity" is the requirement.

Remember that NLBs preserve source IPs (unlike ALBs).

Ensure your Security Groups account for both User Traffic AND Infrastructure Traffic (Health Checks).

Don't over-complicate rules for resources that are already isolated by the network topology.

Happy coding!

The Hidden Trap of Migrating to Private Network Load Balancers on AWS

The Architecture Shift

The "500 Error" Mystery

Trap #1: The Source IP Confusion

Comments (0)

More from this topic

Infrastructure as Code with Terraform: A Practical Guide