Originally written on: 2021-11-05
Rewritten on: 2024-02-02

Backups and Corruption

For years, the recommended practice for backups has been the 3-2-1 Backup Strategy. That means that you have 3 copies of your data: a couple of on-site copies on 2 different devices, plus 1 off-site copy. There are variations on this theme but they are all quite similar.

Most of this article documents how I have implemented the 3-2-1 strategy for my family's data. This is unlikely to be of interest to anyone else, though perhaps it will give you some ideas if you haven't already done this for your own data. But first I'd like to consider the problem of data corruption.

Some filesystems, such as ZFS, use checksums to detect data corruption. If the data that is read from disk produces the wrong checksum, it must have been corrupted. Once the corruption has been detected, it may be possible to automatically repair the data or at least to alert the user so that they can restore it from a backup. If your filesystem does not do this automatically, you can create a similar system with AIDE or Tripwire. However you choose to set it up, it's not hard to automatically detect file changes.

The tricky thing — the problem which originally motivated me to write this article — is that such systems can only tell you that your data has changed, not why your data has changed. A filesystem can assume that if the data has changed then something has gone wrong. When the data is read from disk, it should be the same as it was when it was written to disk: if it isn't, then clearly it was corrupted. This is great, and it should really be required functionality in any modern filesystem, but it obviously doesn't protect you from data corruption at any layer above the filesystem. The filesystem cannot tell the difference between a user deliberately editing a document and some malicious or buggy application overwriting the user's file with junk. Similarly, both AIDE and Tripwire would detect a change to the file — as the checksum would have changed — but neither can automatically determine whether the change was desired or erroneous. The only thing that can tell whether a change to a document was a deliberate edit or unwanted corruption is the human that owns the file.

On a computer that is used by multiple people, such as my family's shared desktop, different users make many different changes to their photos and documents every day. A daily list of modified files would be quite large and no single person would be able to vet the list for unexpected entries. Even worse, these days not every document or image is stored as its own file with a meaningful name: many of my family's photographs live inside Apple's proprietary Photos Library, which uses UUIDs for filenames. How could anyone tell whether a file change is deliberate or not if all they have is an opaque identifier? When documents are stored in a database, the problem is even worse.

As long as we have complete backups, any corrupted files can be restored once the corruption is noticed. However, I was prompted to write this article after I discovered that many of my family photos had been corrupted months earlier. Photographs are a prime example of data that are both valuable and long-lived — decades into the future, I want to be able to look at photos of my children when they were growing up — but which are not frequently manually checked. I do not go over my entire photo collection every day. Restoring individual files from backup, when many other files had genuinely changed over the same period, was an extremely painful and tedious process. The sooner the corruption is detected, the easier it is to repair.

I have not seen any discussion of this problem — all technical discussions around data corruption seem to begin and end with checksums — and I am not aware of any good solutions. Applications such as Beyond Compare are good at graphically displaying the differences between multiple file types: perhaps what we need is the ability to use such a tool to compare the current state of a user's files with the last manually-verified backup. Some way to compare different backups in order to determine the last uncorrupted version would also be helpful. Maybe this could be semi-automated with ZFS snapshots and Beyond Compare? With my current tooling, restoring multiple backups from an external drive and comparing them all is very slow and laborious.

Anyway, the rest of this article documents my mundane backup strategy.

Collection in one place

The first step is to collect all our data in a single place. One day, I would like to get a NAS that we can all use from our various different devices. But at the moment, I use the family's shared desktop, a 2014 iMac, as the shared store. Files from our other computers are periodically copied to the iMac using Unison or rsync.

On-site backup

The "2" of the "3-2-1" backup strategy is provided by Apple's Time Machine software, backing up to an external USB drive. Time Machine cannot be relied on to save all your data (see below) so it is definitely not sufficient alone — but it has a very convenient UI and works well as the first step when looking for a backup.

Off-site backup

Off-site storage is provided by Backblaze B2. I originally chose B2 just because it is so much cheaper than most other cloud storage services but I've actually since been really happy with the functionality that it offers.

The backups are performed using rclone with an extremely simple script:

$ cat /usr/local/bin/offsite-backup.sh
#!/bin/sh
set -eux

/usr/local/bin/rclone --fast-list --transfers=32 sync --links /Users/cc/Documents b2:cc-docs
/usr/local/bin/rclone --fast-list --transfers=32 sync --links /Users/cc/Pictures b2:cc-pics

Where cc-docs and cc-pics are B2 buckets. The real script has more lines in it: everyone in the family has a user account on the iMac and a couple of B2 buckets for their documents and their pictures.

The buckets have versioning enabled, so all old versions of files are retained. That's very important! Apple's Time Machine does not retain all versions of the files that it backs up: it keeps hourly backups for a day, then daily backups for a month, and then weekly backups after that. This means that you cannot rely on Time Machine to repair corrupted data. If you make a deliberate change to a file mid-week and then the file is corrupted before the end of the week, your change will be lost: Time Machine will only retain the versions before your change and the corrupted version. If the change was actually the original creation of the file, then the whole file could be lost. When my family photos were corrupted, there were a few cases in which Time Machine had lost the photo entirely and the only surviving copy was in Backblaze B2.

The rclone configuration is also very simple:

$ cat /Users/cc/.config/rclone/rclone.conf
[b2]
type = b2
account = c981ad2e78f9
key = 619d8af581bb867f157de44f2fae976e3cdbb1d180

The backup script is automatically run nightly at 3:30am by launchd. I want it to run regardless of who is logged in, so it needs to be a LaunchDaemon rather than a LaunchAgent. And it will run as root rather than my own user because it will backup files from other family members' home directories as well as my own.

Ideally, I would have a separate user for running backups and everyone would ensure that the files they want backed up are readable by that user. But that's not going to happen any time soon in this household so I'd rather take the risk of running rclone as root than the risk of failing to backup something important. Since it is running as root, I need to set XDG_CONFIG_HOME so that rclone will find the right configuration file:

$ cat /Library/LaunchDaemons/com.cc.rclone.plist 
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
	<dict>
		<key>Label</key>
		<string>com.cc.rclone</string>
		<key>EnvironmentVariables</key>
		<dict>
			<key>XDG_CONFIG_HOME</key>
			<string>/Users/cc/.config</string>
		</dict>
		<key>Program</key>
		<string>/usr/local/bin/offsite-backup.sh</string>
		<key>StandardOutPath</key>
		<string>/var/log/rclone.stdout</string>
		<key>StandardErrorPath</key>
		<string>/var/log/rclone.stderr</string>
		<key>StartCalendarInterval</key>
		<dict>
			<key>Hour</key>
			<integer>3</integer>
			<key>Minute</key>
			<integer>30</integer>
		</dict>
	</dict>
</plist>

Though I use Apple's Photos application to organise my photographs, solely because it is the only way to conveniently share photos between Apple devices, I do not like having my data locked up in Apple's proprietary format. So the final step in my backup system is a small AppleScript program to export photos. I run this to copy all my photographs out of Photos, retaining the album and folder structure as directories, and it is this export that is backed up to Backblaze B2.