Tuesday, November 12, 2024

Q&A: Solving the issue of stale feature flags

As we saw last week with what happened as a result of a bad update from CrowdStrike, it’s more clear than ever that companies releasing software need a way to roll back updates if things go wrong. 

In the most recent episode of our podcast, What the Dev?, we spoke with Konrad Niemiec, founder and CEO of the feature flagging tool, Lekko, to talk about the importance of adding feature flags to your code, but also what can go wrong if flags aren’t properly maintained.

Here is an edited and abridged version of that conversation:

David Rubinstein, editor-in-chief of SD Times: For years we’ve been talking about feature flagging in the context of code experimentation, where you can release to a small cohort of people. And if they like it, you can spread it out to more people, or you can roll it back without really doing any damage if it doesn’t work the way you thought it would. What’s your take on the whole feature flag situation?

Konrad Niemiec, founder and CEO of Lekko: Feature flagging is now considered the mainstream way of releasing software features. So it’s definitely a practice that we want people to continue doing and continue evangelizing.  

When I was at Uber we used a dynamic configuration tool called Flipper, and I left Uber to a smaller startup called Sisu, where we used one of the leading feature flagging tools on the market. And when I used that, although it let us feature flag and it did solve a bunch of problems for us, we encountered different issues that resulted in risk and complexity being added to our system. 

So we ended up having a bunch of stale flags littered around our codebase, and things we needed to keep around because the business needed them. And so we ended up in a situation where code became very difficult to maintain, and it was very hard to keep things clean. And we just ended up causing issues left and right.

DR: What do you mean by a stale flag?

KN: An implementation of a feature flag often looks like an if statement in the code. It’ll say if feature flag is enabled, I’ll do one thing, otherwise, I’ll do the old version of the code. This is how it looks like when you’re actually adding it as an engineer. And what a stale flag will mean is the flag will be all the way on. So you’ll have fully rolled it out, but you’re leaving that ‘else’ code path in there. So you basically have some code that’s pretty much never going to get run, but it’s still sitting in your binaries. And it almost turns into this zombie. We like to call them zombie flags, where it kind of pops up when you least expect them. You think they’re dead, but they come back to life.

And this often happens in startups that are trying to move fast. You want to get features out as soon as possible so you don’t have time to have a flag clean update and go through and categorize to see if you should remove all this stuff from the code. And they end up accumulating and potentially causing issues because of these stale code paths.

DR: What kind of issues?

KN: So an easy example is you have some sort of untested code based on a combination of feature flags. Let’s say you have two feature flags that are in a similar part of the code base, so there are now four different paths. And if one of them hasn’t been executed in a while, odds are there’s a bug. So one thing that happened at Sisu was that one of our largest customers encountered an issue when we mistakenly turned off the wrong flag. We thought we were kind of rolling back a new feature for them, but we jumped into a stale code path, and we ended up causing a big issue for that customer.

DR: Is that something that artificial intelligence could take on as a way to go through the code and suggest removing these zombie flags?

KN: With current tools, it is a very manual process. You’re expected to just go through and clean things up yourself. And this is exactly what we’re seeing. We think that generative AI has a big role to play here. Right now we’re starting off with simple heuristic approaches as well as some generative AI approaches to figure out hey, what are some really complicated code paths here? Can we flag these and potentially bring these stale code paths down significantly? Can we define allowable configurations? 

Something we see as a big difference between dynamic configuration and feature flagging itself is that you can combine different flags or different pieces of dynamic behavior in the code together as one defined configuration. And that way, you can reduce the number of possible options out there, and different code paths that you have to worry about. And we think that AI has a huge place in improving safety and reducing the risk of using this kind of tooling.

DR: How widely adopted is the use of feature flags at this point?

KN: We think that especially amongst mid market to large tech companies, it’s probably a majority of companies that are currently using feature flagging in some capacity. You do find a significant portion of companies building their own. Often engineers will take it into their own hands and build a system. But often, when you grow to some level of complexity, you quickly realize there’s a lot involved in making the system both scalable and also work in a variety of different use cases. And there are lots of problems that end up coming up as a result of this. So we think it’s a good portion of companies, but they may not all be using third-party feature flagging tools. Some companies even go through the whole lifecycle, they start off with a feature flagging tool, they rip it out, then they spend significant effort building similar tooling to what Google, Uber, and Facebook have, these dynamic configuration tools.


You may also like…

Lessons learned from CrowdStrike outages on releasing software updates

Q&A on the Rust Foundation’s new Safety-Critical Rust Consortium

Related Articles

Latest Articles