How I Tested Ceph Configuration Changes Without Full Replication

Why testing Ceph changes without full replication matters

Testing Ceph changes matters if you do not want to find the mistake in production. A bad config can mean data loss or downtime, and neither is fun. That gets expensive fast when you are running Ceph in a Proxmox cluster and the idea of building a full duplicate lab is a non-starter.

The point is to make the change in a way that gives you a way back. I would rather spend time on that than rebuild the whole stack because one setting was wrong.

What I actually use for testing

The useful bits are fairly simple.

Snapshots: I use snapshots before changes so I have a point to roll back to if the new setup misbehaves.
Staging environments: A smaller Proxmox Ceph staging setup lets me try changes without pointing them at live data.
OSD hot-swap testing: Replacing drives without taking the cluster down is part of keeping the thing running.
Configuration rollbacks: If the change goes sideways, I want a clear rollback path rather than a vague plan and a sinking feeling.

Using a Proxmox Ceph staging setup

A staging cluster does not need to be a perfect clone. It just needs to be close enough to show me whether the change behaves the way I expect.

Create a smaller cluster: I mirror the production layout as closely as I can, using fewer nodes or lower-spec hardware where needed.
Apply the change: I make the planned Ceph config change in staging first.
Watch the cluster: I check the impact on performance and stability with Ceph’s built-in monitoring.
Write it down: If it works or fails, I keep notes. Memory is a terrible change log.

This is not glamorous, but it is cheaper than rebuilding a cluster because I skipped the boring bit.

Hot-swapping OSDs without taking everything down

Replacing an OSD drive is one of those jobs that looks simple until you do it wrong. My usual process is cautious and dull, which is the point.

Identify the OSD: I check the OSD status first with ceph osd tree.
Prepare for replacement: I drain the OSD so data moves elsewhere before I touch the drive.
Replace the drive: Once the OSD is drained, I swap the drive out.
Bring it back: After that, I re-add the OSD with ceph orch apply osd.

That keeps the cluster running while the hardware change happens. It is still a faff, just less of one than a full outage.

Snapshots and rollback

Snapshots are handy because they give you a known state before the change. If the new config breaks something, I can go back to the snapshot rather than starting from scratch.

For me, the important part is not the snapshot itself. It is having a clear rollback path before the change goes in. If I cannot describe the rollback in one breath, I have not thought it through enough.

Keeping a rollback process that actually works

Back up the current config: I save the current settings before changing anything.
Record the change: I note what changed and why.
Revert if needed: If the change causes trouble, I restore the previous config and restart the relevant services.
Check cluster health: After the rollback, I run ceph health and check the state of the cluster.

That does not remove the risk. It just keeps the risk manageable.

What I would tighten up next

I still want more automation around the testing and rollback steps, mostly because fewer manual steps means fewer stupid mistakes. I also keep an eye on Ceph updates and community notes, because cluster behaviour has a habit of changing when you least want it to.