src=”https://buffer.com/resources/content/images/2022/04/kari-shea-1SAnrIxw5OY-unsplash.jpg” alt=”Secure Access To Opensearch on AWS”>

Buffer has been working to improve our admin dashboard for customer advocacy. The admin dashboard featured a more powerful search function. As the project was nearing its end, we were prompted to replace managed Elasticsearch on AWS by managed Opensearch. Our project was built on top of an older version of elasticsearch client that suddenly did not support Opensearch.

OpenSearch clients that support the languages we use did not yet have transparent AWS-Sigv4 signatures. AWS Sigv4 signature is required to authenticate to OpenSearch cluster using AWS credentials.

The only way forward was to choose from one of these options

If you leave our search cluster unauthenticated, it will work with the OpenSearch client. This is an obvious no-go for obvious reasons.
Refactor your code to send raw HTTP request and implement AWS Sigv4 ourselves on those requests. This is impossible, and we don’t want to create a client library from scratch!
Create a plugin/middleware that allows the client to implement AWS Sigv4 signature. Although this would be a good idea at the beginning, Buffer is not large enough to maintain it.
Change our infrastructure to an elasticsearch cluster hosted in Elastic’s Cloud. This required a lot of work as we reviewed Elastic’s Terms and Conditions, pricing, requirements for secure networking, and other time-consuming measures.

This project seemed to be stuck for the long-term! Or was it?

These are the things we cannot change.

The elasticsearch client is no longer available.
If the cluster was not closed and authentication was not required, switching to the OpenSearch client will work.
For obvious reasons, we cannot leave the OpenSearch cluster unprotected.

It would be great if the OpenSearch cluster were only available to those applications that actually need it.

This would allow those applications to connect to the cluster with no authentication. However, they would not be able access the cluster for all other purposes.

We designed the following solution with this end goal in mind.

Our recent migration from self-managed Kubernetes into Amazon EKS is a piggyback.

Recently, we moved our computational infrastructure from a Kubernetes cluster managed by ourselves to another cluster managed by Amazon EKS.

We also changed our container networking interface (CNI), from flannel, to VPC CNI. This means that we have eliminated the overlay/underlay network split, and all our pods now receive VPC routable IP address.

This will be more important in the future.

Block access to the outside world from the block cluster

OpenSearch cluster was created in a private VPC. There are no internet-facing IP addresses. The cluster’s IP addresses will not be accessible over the internet, but only to internal VPC Routable IP addresses.

To control which VPC IP addresses can reach the cluster, we added three security groups.

Automate the access to the cluster by creating automations

Two automations were built that ran as AWS Lambdas.

Security Group Manager: This automation is capable of performing two processes on-demand.
-> Add an IP Address to one of the three security groups. This is the one that has the fewest rules at the time.
-> Take out an IP address from all three security groups.
Pod Lifecycle Auditor – This automation runs according to schedule, and we’ll show you exactly what it does in a second.

It all connects together

To all pods that need access to the OpenSearch cluster, we added an InitContainer. This will, on-start execute the Security Group Manager automation, and ask it for the pod’s IP address. This allows it access to the OpenSearch cluster.

Real life is unpredictable and things can happen. Pods are killed or given new IP addresses. The Pod Lifecycle Auditor checks the whitelisted IP addresses within the three security groups that allow access to the cluster. The Pod Lifecycle Auditor then checks for IP addresses that should not exist and asks the Security Group Manager to delete those IP addresses.

This diagram shows how it all connects together.

src=”https://buffer.com/resources/content/images/2022/04/Buffer-s-Automation-for-OpenSearch—Page-2–2-.png” class=”kg-image” alt=”Secure Access To Opensearch on AWS” loading=”lazy” width=”2000″ height=”1287″ srcset=”https://buffer.com/resources/content/images/size/w600/2022/04/Buffer-s-Automation-for-OpenSearch—Page-2–2-.png 600w, https://buffer.com/resources/content/images/size/w1000/2022/04/Buffer-s-Automation-for-OpenSearch—Page-2–2-.png 1000w, https://buffer.com/resources/content/images/size/w1600/2022/04/Buffer-s-Automation-for-OpenSearch—Page-2–2-.png 1600w, https://buffer.com/resources/content/images/size/w2400/2022/04/Buffer-s-Automation-for-OpenSearch—Page-2–2-.png 2400w” sizes=”(min-width: 720px) 720px”>

Diagram of our solution to Opensearch access problems using automated whitelisting. Source: Peter Emil, Buffer’s Infrastructure Team

Extra Gotchas

We created three security groups to control access to the OpenSearch cluster.

Security groups are limited to 50 ingress/egress rules. We expect that there will be no more than 70-90 pods in the cluster at any one time. We feel that 150 rules is a good limit, so we have three security groups.

Do I need to host Opensearch in the same VPC that the EKS cluster?

It all depends on how your network setup works! You can host your VPC in any VPC you choose if it has private subnets that have NAT gateways. You must host the clusters in the same VPC if you don’t have private networks. VPC CNI defaults to NAT VPC-external pod traffic at the hosting node’s address, which is invalid. Your pods won’t be able to reach the internet if you disable the NAT configuration. This is a much bigger problem.

Is it possible for a pod to get stuck in CrashLoopBackoff?

Because container crashes within a pod are restarted with the exact IP address within that pod. The IP Address has not been changed.

Aren’t those automations a single-point-of-failure?

They are indeed, and it is important to approach them from an SRE perspective. Rolling deployments and adequate monitoring of these automations are key to ensuring reliability. Since the automations were implemented, they have been stable and have not experienced any incidents. I can rest easy knowing that if any of the automations fail, I will be notified so it doesn’t become a problem.

Conclusion

Although this solution was not perfect, it was the fastest and most efficient solution without the need for continuous maintenance or the time required to on-board a new cloud provider.

More to you

What did you think about the approach? Are you familiar with similar situations within your company? We’d love to hear from you!

Did you miss our previous article…
https://onlinemarketingagencies.net/?p=4055