How is self-healing to be reconciled with Infrastructure as Code?

HOW TO -️ October 18, 2021

As a relative newcomer to the developments happening in Operations - DevOps, SRE, etc. - I'm struggling with a big-picture problem. The following is a specific example, using a network load-balancer.

From an IaC perspective, I take it that we should hold the configuration (ie. the pools for the different virtual servers) outside of the load-balancer itself. Changes to the load-balancer should be made in the first place to this external configuration and applied to the actual application via a pipeline.

From a 'self-healing' perspective - taking automated remedial actions to problems - we might want code to change the network load-balancer configuration in response to some event. If some sensor indicates a problem with a system, for example, we might want to have it removed from a pool.

Now, the obvious way of reconciling these two requirements is for the automated remedial actions to take place via a change to the load-balancer's external configuration. However, some of the reasons for wanting the IaC approach, and some possible implementations of IaC, can make this problematic. For instance, suppose that the external configuration of the load-balancer is held in yaml files in a Git repository, and there are controls over the check-ins (someone senior has to review them). In this case, the remedial action could get stuck in someone's to-do list.

Obviously the problem gets worse the more frequently and more rapidly we want such automated responses.

I was wondering what people's responses to this might be. Is the main problem using something like Git to hold the configuration? Is it a mistake to apply the same controls to all check-ins? Is it a mistake to try to treat the load-balancer set-up as 'configuration' in this way (note that it has some run-time state which isn't captured anywhere, like internal responses to health-checks that it runs). Would appreciate any thoughts.


I believe this is a very reasonable question, also I believe this is an open problem in many aspects. However, for most practical issues you may treat IAC as a target (desired) state, and then work around that with other tools.