14 February 2026

Ensuring safe AI deployment in critical infrastructure

Managing AI that touches pumps, valves or power racks is not the same as running a chatbot. I treat any AI that influences physical processes as part of the control system. That changes how I design, test and change it. AI Risks in Operational Technology and Critical Infrastructure are about slow failures, hidden config errors and silent model drift, not just dramatic hacks. I write this as a practical how-to. No theory, only steps I would follow.

Start with strict Configuration Management for models and their runtime settings. Inventory every model, every pipeline, every config file and tag them with immutable identifiers. Store model binaries and configuration in a version control system with signed releases. Use git tags, signed container images and a small change database that records who changed what, why and the rollback point. Make sure every config change has two-person sign-off before it moves into the test rig. Test changes in a replica environment that uses the same telemetry and timing as the live system. Run the proposed change against historical telemetry for at least 72 hours to spot slow divergences. Use canary rollouts for live changes: route 1 to 5 per cent of traffic to the new model for a 24–72 hour window and watch behaviour. Automate rollback triggers so a simple metric breach flips the change back without manual scripts.

Human oversight is not optional. I require a defended, physical or logically separate kill-switch that returns actuators to a safe state. The switch must be limited to named operators and use multi-factor authentication. Exercise that switch quarterly and log the result. Create a simple override UI that makes the safe state obvious, and test that UI on the same shift patterns your engineers use. Define worst-case behavioural scenarios for each AI-enabled control. For a pump controller that uses ML to schedule starts, list scenarios like “stuck-on”, “stuck-off”, “oscillation” and “slow drift over 48 hours”. Write runbooks that map each scenario to a single operator action and a time-to-action target. I aim for an emergency rollback or safe-state within 15 minutes for functions that directly affect safety or load.

Monitoring is the only way to catch model drift and subtle telemetry shifts. Instrument both model inputs and outputs. Record raw inputs, preprocessed inputs, confidence scores and final actuator commands. Keep at least 90 days of high-resolution telemetry for trending. Use simple statistical drift detectors alongside any specialised tooling. For example, set an alert if the mean of a key input moves by more than 3 standard deviations over 24 hours, or if an output confidence drops by 30 per cent compared with the previous week. Feed those alerts into the operational NOC and the AI governance channel so a single pager gets attention. Also monitor configuration integrity: hash the active config and compare it to the signed release once per hour. Fail closed if the hash differs.

Governance ties the technical controls to decisions. I expect a small governing body that meets weekly during rollouts and monthly otherwise. Give that body two responsibilities: approve change windows and measure maturity. Use short metrics: percentage of config changes with two approvals, time from detection to rollback, frequency of kill-switch exercises. Keep the governance charter tight. Limit who can change AI settings to named engineers and named approvers. Enforce role separation in the CI/CD pipeline: developers prepare the artefact, a separate gate signs it, operations deploy it. Audits should be automated where possible. Keep logs immutable and retained for at least one year.

Uptime Kuma | 2.1.1

Uptime Kuma 2

Grafana | v12.3.3

Explore the Grafana v12

Popular Topics

PopularView All

Grafana | v12.3.3