29 November 2025

Cleaning duplicate data to streamline backups

Optimising Your Homelab: Storage and Backup Strategies

I keep my homelab practical and lean. Storage costs and power draw matter more than shiny specs. My aim here is clear: reduce wasted capacity, cut backup time, and make restores predictable. I use concrete steps and numbers from my own reshuffle so you can copy the bits that fit your setup.

Introduction

Importance of storage optimisation

Storage is the biggest ongoing cost in a homelab. Drives age, power bills add up, and an overcomplicated layout makes restores slow and error prone. I target usable capacity, power draw and recoverability. Hitting those three reduces headache and keeps the lab useful rather than decorative.

Common challenges with duplicate data

Duplicates hide everywhere. Photo libraries, VM snapshots, and automated media grabs create multiple copies. In my case I started with about 50 TB raw across five arrays. That included parity, two backup arrays and a lot of duplicated content. The obvious result was wasted space and longer backup windows.

Overview of backup strategies

Backups are a risk-management choice. For me the plan was three replicated copies of the core 4–5 TB of primary data, plus a separate backup array sized for about 10 TB. That gives quick local restores and a second copy for disaster recovery. Pick a policy that matches how long you can tolerate downtime and how much you can afford in storage and power.

Strategies for Cleaning Duplicate Data

Identifying duplicate data sources

Start by profiling. Make a list of services that hold large files: media servers, photo sync tools, VM stores, and old archive folders. Run simple checks:

Find largest directories with du -sh /* and drill down.
Use find to spot old snapshot chains.
Check application configs, for example Immich and Jellyfin, to see where they store originals versus transcodes.

Look for patterns. If a folder holds many identical files across date stamps, that is likely duplication from sync tools. If VMs have frequent snapshots, those snapshots may duplicate whole disks.

Example: I found two arrays holding the same media collection because one was the primary and the other was a casual copy made years earlier. Deleting that copy freed several terabytes.

Tools for data management

Pick tools that match file types and volume.

For photos and media, deduplication tools like rmlint or fdupes work at the filesystem level.
For large block-level duplication across arrays, consider zfs send/receive with checksums to validate transfers.
For catalogue-driven systems such as Immich, export a manifest and compare hashes rather than relying on filenames.

Use checksums, not timestamps. SHA1 or xxhash are fine for large sets. Run a dry-run mode first to list candidates, then confirm before removal. Keep a temporary quarantine area for 48–72 hours in case you remove something needed.

Best practices for data handling

Adopt simple rules and stick to them.

One source of truth: decide which dataset is primary and mark others as copies.
Archive policy: move rarely used data to a lower tier rather than keep multiple hot copies.
Snapshot hygiene: limit snapshot retention to what you can justify. Old snapshots are duplicate storage.
Naming and metadata: keep consistent folder layouts so automated tools do not create accidental duplicates.

When deleting, move rather than purge. I move candidates to a temporary hold on a single array for a week. After validation I remove them. That prevents accidental data loss.

Reorganising storage arrays

Consolidation reduces complexity and power draw. I moved from multiple R0x6 arrays to a single R5x10 array. That change reduced management overhead and improved reliability for my workload. It also let me retire an older VNX5300 and drop power consumption by roughly 75–100 W with a KTN-STL3 in the new setup.

When you reorganise:

Plan capacity with growth in mind. Target usable capacity after parity and replication.
Map services to arrays by access profile: fast arrays for VMs and active media, larger slower arrays for archives.
Balance parity and drive count. Higher drive counts per RAID set improve capacity efficiency but increase rebuild times. For my media, a single R5 array offered a good mix of reliability and usable space.

If you are migrating, use rsync with checksums for file data or zfs send/receive for ZFS datasets. Test a restore from the new array before retiring the old one.

Network optimisation techniques

Networking limits migrations and backup windows. Upgrade links only where you can saturate them. In my lab I added a mix of 10 G and 2.5 G links. The 10 G link is for large dataset moves between servers. The 2.5 G links handle daily traffic and media streaming.

Small optimisations make a big difference:

Jumbo frames can increase throughput for large file copies. Test with iperf3.
Tune TCP window sizes for long transfers over the local LAN if you see high latency.
Use parallel streams for rsync or rclone to keep links busy during migrations.

If your storage array can serve multiple clients, ensure it has enough CPU and network buffer to cope; otherwise add a small dedicated mover host to absorb the migration load.

Future considerations for backup systems

Aim for predictable restores, not maximum redundancy. For me that meant:

Keep at least one off-site copy for disaster recovery.
Automate verification. A backup that is not regularly verified is a liability.
Review power costs when adding arrays. Larger drives reduce per-TB power and chassis overhead.
Consider migration to Proxmox (PvE) if you need better networking or VM density than Hyper-V offered in my previous setup.

Think of backups as a flow: primary data -> local replicate -> off-site copy. Cut the amount of primary data with good duplicate data handling and the rest becomes cheaper and faster.

Final takeaways

Audit first. Know where the duplicates live and quantify how much space they use.
Use checksums and dry runs before deleting. Move before purge.
Consolidate arrays where reliability gains justify the change. I went from five arrays to a plan that centres on a single R5x10 array and backup arrays sized for about 10 TB.
Tune your network for large transfers, but upgrade only when you can use the extra throughput.
Set a recovery policy and verify backups regularly.

Follow those steps and your homelab storage optimisation will pay in saved power, less time babysitting backups, and predictable restores.

Repurposing the PowerSpec 2900 for homelab storage

Repurpose a PowerSpec 2900 for homelab storage

Configuring a Raspberry Pi 4B for SyncThing and vsFTP

Setting Up a Mini Rack: Essential Software Configurations for Your Homelab I use

Popular Topics

PopularView All

paperless-ngx | v2.20.9

paperless-ngx | v2.20.9

Flux | v2.8.1

Flux | v2.8.1

Cleaning duplicate data to streamline backups

Optimising Your Homelab: Storage and Backup Strategies