Backup retention policy: writing it down saves arguments

A backup schedule that works is not the same as a documented one. When recovery fails and a client asks why their data is gone, a working cron job and a vague memory are not a defence; a policy document and a tested restore procedure are.

Document your backup retention policy before the client blames you for old data

Retention schedules are where sysadmin trust goes to die. The backup job runs nightly, the cron tab looks fine, and then six months later someone asks why they can only recover three weeks of data when the contract says four. At that point, you need a document, not a memory.

Why the audit trail breaks without a named policy owner

The gap between what a backup job actually does and what a policy document says it does tends to grow quietly. A schedule gets changed during a server migration, nobody updates the doc, and the doc slowly becomes fiction. The job runs, ticks green in the monitoring dashboard, and the discrepancy sits unnoticed until recovery is needed.

The more specific problem is ownership. When no one is named as responsible for the retention schedule, changes happen informally. A cron job gets tweaked. A retention script gets rewritten and dropped into /etc/cron.d/ by whoever was on call that weekend. There is no commit message, no change log, no ticket. Months later, the script is running on three servers and nobody is sure which version is canonical.

Orphaned retention scripts

This is genuinely common. A retention script gets written to solve an immediate problem, it works, it gets forgotten. When an incident occurs and someone pulls the script to verify it matches the policy, the script has no author comment, no date, no version, and often no relation to whatever the policy document claims the retention window is. Fix this now by adding a header block to every retention script you maintain:

bash

!/bin/bash

Retention policy: 28-day rolling window

Covers: /mnt/backups/client-name/

Policy document: docs/backup-policy-v2.md

Last reviewed: 2024-11-01

Owner: ops@yourdomain.com

That header costs two minutes and saves a very unpleasant conversation.

ISO 27001 Annex A 8.13

If you are working towards ISO 27001:2022 certification, Annex A control 8.13 requires a documented policy that specifies backup frequencies, retention periods, and a tested restoration process. It replaced the older A.12.3.1 control and added explicit expectations around encryption and restore verification. An auditor will ask to see the written policy and evidence that restores have been tested against it. A cron job and a verbal explanation do not satisfy that.

Where client expectations diverge from your actual data recovery SLAs

The 28-day assumption is almost universal. Clients hear “daily backups” and conclude they can recover anything from the past month. The actual default for many common tools sits at 7 days. AWS RDS, for example, defaults to a 7-day retention window when you create a DB instance through the console. Power Platform non-production environments default to 7 days. If you never discussed the retention window explicitly, the client filled in the gap with their own number.

Getting RPO and RTO out of the contract and into the config

A contract that states a 4-hour RTO and a 1-hour RPO is only useful if the backup schedule, retention window, and restore procedure are configured to match. Check three things:

  1. The backup frequency matches the RPO. If the RPO is 1 hour, a nightly backup does not meet it.
  2. The retention window matches what the contract specifies. If the contract says 28 days, verify the actual retention setting in your backup tool, not just the cron schedule.
  3. The restore procedure has been timed. An RTO claim is unverifiable until you have run a restore and measured it.

Write the actual figures into the policy document. “RPO: 1 hour. Retention: 28 days. Last tested RTO: 2h 14m on 2024-10-15.” Vague language in a policy document means nothing to anyone and protects no one.

Documenting backup schedule automation so it can be verified

If you are using a tool like Proxmox Backup Server, Veeam, or a cloud-native scheduler, export the job configuration and store it alongside the policy document. For PBS, that means capturing the datastore settings and the prune job parameters:

bash
proxmox-backup-manager datastore list –output-format json > docs/pbs-datastore-config.json

For a GFS (Grandfather-Father-Son) retention scheme, document the exact keep values explicitly. Do not rely on the UI screenshot. A plain text record such as:

keep-daily: 7
keep-weekly: 4
keep-monthly: 6
keep-yearly: 1

is readable, diffable, and can sit in a git repository next to the policy doc.

Version-controlling the policy document

Put the retention policy document in git. A flat Markdown file in a private repository gives you a full change history, blame output, and a reference point if a dispute arises. When the client asks whether the retention window was 14 or 28 days in March, git log answers that question without ambiguity.

Tag each version. Use a simple header block at the top of the document:

Version: 2.1
Effective: 2024-09-01
Approved by: [name] Previous version: 2.0 (2024-03-15)

Any change to a retention window, backup frequency, or covered scope requires a new version and a new effective date.

Testing the policy against the actual restore time

A policy that has never been tested is a guess with formatting. Schedule a restore test at least once per quarter, cover at minimum one full restore and one file-level recovery, and record the results in a test log stored with the policy document.

The test log entry needs four things: the date, the data set restored, the time taken from trigger to verified recovery, and whether the result matched the SLA. If the restore took 3 hours and the SLA says 2, that is a finding, not a footnote. Update either the procedure or the SLA figure, then re-test.

For backup schedule automation, verify that the job actually ran on the days the policy claims. Pull the job history, cross-reference against the retention window, and confirm that the oldest recoverable backup matches what the document says. If the policy states 28 days and the oldest restore point is 19 days old, find out why before the client does.