A dark lab build only works when the same input gives the same image every time. If the builder drifts, the snapshot is just a prettier way to lose time later.
A build is only reproducible when the inputs are pinned and boring
Lock the base image, package versions, and build tool versions. Pin them in the image manifest, the package lock file, or the provisioning manifest, and stop pretending that “latest” is a strategy.
Keep build steps ordered and explicit. A build script that depends on whatever happened to be on the box at 02:00 is not a pipeline, it is a gamble with logs.
Remove hidden state from the builder. Clear caches that affect output, set the same environment variables every run, and rebuild on a clean node or a clean container image so stray files do not sneak in.
Treat the same commit, same image, same result rule as non-negotiable. If the output changes without a source change, the pipeline has already failed, even if the final artefact looks fine.
Rollback only works when snapshots match the build, not just the app
Take snapshots at the same point in the pipeline every time. Pick one point, such as after config has been applied and before the image is promoted, then stick to it so recovery lands on a known state.
Pair artefacts with a known-good runtime state. Store the image digest, the config revision, and the snapshot ID together, because a rollback without that mapping turns into guesswork fast.
Keep rollback steps short and rehearsed. If you need half a page of notes to revert a failed run, the recovery path is already too messy for a dark, fully automated lab build.
Test failure recovery against real rebuilds, not a happy path clone. Break the config, lose the node, or invalidate a package cache, then watch whether the self-hosted orchestration can restore the last known good state without manual rescue.
Configuration management matters here because it is part of the build, not a side note. A clean snapshot taken after a drifted config has been baked in only preserves the mistake more neatly.
Failure recovery also needs a hard boundary between build state and runtime state. If a rollback only rewinds the app image but leaves the datastore, kernel, or service config out of step, the next boot can fail in a new and noisier way.
A dark factory automation setup is useful only when the rebuild path and the rollback path are both dull. Dull means pinned versions, fixed steps, matched snapshots, and no mystery state hiding under the floorboards.

