Document what your backup excludes, not just what it covers

I've built backup jobs that looked fine right up until the restore failed. The gap between what you think is covered and what actually is covered lives in the space where documentation should be; write down what is excluded, not just what you protect, and you stop arguing about it later.

Document what your backup excludes, not just what it covers: backup documentation and scope definition

The argument starts before the incident. A restore fails, a client reports missing data, and the signed contract says “website files” with no mention of the database sitting two directories above the web root. Nobody lied. Nobody was careless. The scope was just never written down clearly enough to survive contact with reality.

What you see

The client reports data missing that you assumed was covered

The client calls after a failure and names a path or volume that never appeared in any backup job you can find. Your assumption was that it fell inside the agreed scope. Their assumption was the same. Neither assumption was written down, so now you’re both in a gap that exists only on paper, or rather in the absence of paper.

Recovery fails on a path or volume not named in any signed document

The restore job runs, returns no errors, and recovers nothing useful because the target path was never included in the backup configuration. The client expected it to be there. The job never knew it existed. This is not a technical failure; it is a documentation failure wearing a technical mask.

Incremental jobs complete without errors but exclude entire directories silently

Most backup tools will not warn you when a directory is excluded. The job finishes, the log says success, and /var/lib/mysql has not been touched since the initial full run three months ago. Silent exclusions are the nastiest variant of scope creep because the monitoring stays green right up until the moment you need it not to.

The contract says “website files” and the database was never mentioned

“Website files” is not a backup scope. It is a phrase that means something different to whoever reads it. I’ve seen contracts where it was interpreted as /var/www/html only, and others where the client believed it covered the entire server. Write the actual paths. Write what is excluded. “Website files” belongs in a marketing brochure, not a service agreement.

A legacy server stays live because nobody wrote down who owns deletion authority

The decommission date passed. The server is still running. The backup job is still running against it, consuming retention space, and nobody wants to be the one to pull the plug without written authority. Ownership clauses get skipped at onboarding because they feel premature. They matter most at end-of-life.

Where it happens

Scope defined verbally at onboarding and never written down

Verbal scope is the root cause of most of these disputes. A conversation happens, both parties leave with a shared understanding that diverges by six months in, and neither has a document to refer back to. Get it in writing before the first job runs.

Backup jobs configured by one person, reviewed by nobody, inherited by a third

The person who built the job knew what it included and excluded. That knowledge lived in their head. When they left, the next person inherited a running job with no context. When something went wrong, there was no record of intent, just a config file and a gap.

Client-side infrastructure changes that your job schedule never knew about

A new volume gets mounted. A service migrates to a new directory. A database moves hosts. None of these trigger a notification to the backup configuration. The job keeps running against the old paths, completing without error, protecting nothing new. Without a change notification clause in the contract, this happens silently every time.

Retention periods set at deployment and not revisited when data classes changed

A 30-day retention period made sense for the original data set. Two years later the data classification changed, a contractual obligation now requires 90 days, and the backup tool is still rolling off at 30. Retention is not a set-and-forget value. It needs a review trigger tied to data class changes.

RTO and RPO agreed in principle but absent from the actual contract text

“We’ll get you back up quickly” is not an RPO. An RPO is a number: four hours, 24 hours, 15 minutes per data class. If it is not in the contract text as a specific figure, it does not exist as an obligation. I’ve watched this go badly in both directions, clients assuming near-zero RPO on a daily incremental schedule, and providers assuming a 24-hour window on data that changes hourly.

Find the cause

Compare the signed scope document against the live backup job include and exclude paths

Pull the include and exclude paths from the live job configuration and lay them next to the signed scope document line by line. Every path in the job that has no corresponding entry in the document is an undocumented assumption. Every path in the document that has no corresponding job entry is an unfulfilled promise.

Check whether incremental chains reference a full backup that predates the current server state

If the last full backup ran before a significant infrastructure change, the incremental chain may be extending from a base that no longer reflects the live environment. Check the full backup date against any known infrastructure changes. If the full predates a migration, volume addition, or directory restructure, the chain is suspect.

Identify which party last modified the infrastructure the backup job points to

This matters for liability. If the client moved /var/lib/mysql to a new mount point and did not notify you, and the job kept running against the old path, the failure origin is on their side. If you reconfigured the job and introduced an exclude rule without written sign-off, it is on yours. Get a change log. If one does not exist, build one now and accept that historical attribution is gone.

Confirm whether the RPO interval matches the backup schedule frequency per data class

An hourly-change database on a nightly incremental schedule has an effective RPO of up to 24 hours regardless of what the contract says. Map each data class to its change rate and compare it against the actual job frequency. Where they do not match, either the schedule needs adjusting or the RPO commitment needs correcting.

Establish whether any exclusions were added by the backup tool automatically, such as locked files or temp directories

Most backup agents add automatic exclusions for locked files, temp directories, and certain system paths. These are not always visible in the main job configuration; check the agent logs or the tool’s default exclusion list. If /tmp, swap files, or open database handles were silently excluded, document it explicitly so it is no longer a surprise during a dispute.

Fix

Write a named exclusion list into the contract, not just the inclusion list

The inclusion list tells you what is protected. The exclusion list closes the argument about everything else. Name the specific paths excluded: /tmp, swap partitions, any ephemeral storage, application cache directories. If the database is excluded because it requires an agent-aware backup method, name the database, name the exclusion, and name the separate process that covers it.

Define RTO and RPO per data class and attach them as a schedule to the service agreement

RTO and RPO need a number, a data class, and a signature. A schedule attached to the service agreement works well because it can be updated independently of the main contract body. List each data class, its RPO interval, its RTO target, and the backup method that achieves it. If daily incremental only supports a 24-hour RPO, write 24 hours. Do not write “best efforts.”

Record ownership of each infrastructure component that the backup job touches

For each server, volume, directory, or service covered by the backup job, record who owns it: who can authorise changes to it, who is responsible for notifying you when it changes, and who has authority to decommission it. This does not need to be a complex document. A table with four columns works: component, owner, change notification contact, decommission authority.

Version-control the backup configuration and tie each change to a named decision

Store the backup job configuration in a git repository. Every commit should reference the decision that drove the change: a contract amendment, a client request, an infrastructure notification. This gives you a timestamped record of who changed what and why, which is the only thing that matters when the dispute arrives. A flat config file with no history is evidence of nothing.

Add a clause covering legacy or decommissioned systems: who retains them, who deletes them, and by when

Write a decommission clause that names the process: when a system goes out of scope, who sends written confirmation, who removes it from the backup job, how long the last backup set is retained, and who authorises deletion. Without this clause, you get the situation described above: a server that should have been decommissioned months ago, still running, still being backed up, because nobody has authority on paper to stop it.

Check it’s fixed

Run a restore test against the documented scope and record which paths succeeded and which failed

Do a full restore test, not a spot check. Run it against every path listed in the signed scope document and record the result per path: succeeded, failed, or absent from backup. Any path that fails or is absent needs a corresponding entry in the exclusion list or a fix to the job configuration. Do not sign off until the restore results match the documented scope exactly.

Verify the exclusion list in the contract matches the exclude paths in the live job configuration

Pull the exclude rules from the live job, compare them against the exclusion list in the contract, and confirm they match. If the job excludes /var/log but the contract does not mention it, add it. If the contract lists an exclusion that no longer applies, remove it. The two lists must be identical.

Confirm the client has signed off on the updated scope, including any systems explicitly out of scope

Get a signature on the updated scope document, including the exclusion list and the decommission clause. A client who has signed an exclusion list cannot later claim they assumed those paths were covered. This is the whole point of the exercise.

Test that the incremental chain restores to within the agreed RPO window, not just to the last snapshot

A restore test that recovers the most recent snapshot proves nothing about RPO compliance. Pick a point in time within the agreed RPO window and restore to that point. If the agreed RPO is four hours and you cannot restore to a point four hours before the test, the backup strategy does not meet the contract. Fix the schedule or fix the commitment.

Schedule a quarterly review date in the contract so scope drift gets caught before it causes a dispute

Put a review date in the contract. Quarterly works for most environments; monthly is reasonable where infrastructure changes frequently. At each review, compare the live job configuration against the signed scope, check whether any new systems or volumes need adding, and confirm retention periods still match current data class requirements. Scope drift is not dramatic. It accumulates slowly, one unnoticed change at a time, and a regular review is the only thing that catches it before it becomes a recovery failure.