From Spot Ocean to Karpenter - One Year Later

Script:

--- slide 1 ---

Hi and welcome, so, at adjoe we have been running almost exclusively on spot instances for now well over 6 years. For the vast majority of this time we've been using Spot Ocean as our solution for cluster auto scaling but as we grew in terms of traffic numbers and EC2 instances we encountered more and more problems with it and we became less and less fond of relying on a 3rd party provider for one of the most critical parts of our Infrastructure, if a traffic spike hits you and you cannot scale your cluster because a third party service is unavailable at the time well then tough luck. That in addition to the ever increasing cut we had to pay to to Spot Ocean has led us to explore other alternatives. What we found is Karpenter and now we've been running Karpenter in production across multiple clusters for more than a year, during this time we ran into all kinds of issues and today I want to share with you the lessons that we have learned along the way. But first a few sentences about myself

--- slide 2 ---

I am Marius, I work as a DevOps Engineer at adjoe, a fast growing adtech scale-up based in Hamburg Germany, where I spent most of my time on optimizing our Infrastructure for reliability and cost efficiency. I also contribute to open-source, I've worked on CoreDNS in the past, made some small contributions to Karpenter by now and maintain a few Nix packages. Following the latest trends not only in tech but also in my personal life I got really into padel recently

--- slide 3 ---

When it comes to adjoe, we have grown our headcount by over 50% since the last time I spoke at an SREday event, which was in March this year - so really not that long ago. Our backend is now running on more than 300 EC2 instances, sometimes it's more sometimes it's less, I will go into more detail on that - and with all this infrastructure we are now handling more than 5 billion API requests daily.

--- slide 4 ---

--- slide 20 ---

You would think everything would be great here from here, we successfully migrated and lived as happy Karpenter users until the end of our days but actually we would have alerts go off at completely random times - and of course by completely random I mean at 4 am on Friday - because some nodes would get stuck in an unreachable state indefinitely.

--- slide 21 ---

This can happen for a variety of reasons, the could provider could have given you a bad node where they have hardware issues on their end or the kubelet could have been oom killed - either way, these cases required now manual intervention from us every time to terminate the node.

-- slide 22 ---

After some research we found that both these cases were automatically handled for us in Cluster Autoscaler and Spot Ocean but Karpenter before version 1.1.0 doesn't - they had some reasoning for it of course, they wanted to prioritize people debugging these kind of nodes and to help them find the root cause but in production, that's rarely what we want. We want to move on and get back into a good state as quickly as possible. Now, as I said, Karpenter fixed this problem in Version 1.1.02