Written on: 2021-11-05

Backups and Corruption

For years, the recommended practice for backups has been the 3-2-1 Backup Strategy. That means that you have 3 copies of your data: a couple of on-site copies on 2 different devices, plus 1 off-site copy. There are variations on this theme but they are all quite similar.

Having a local backup of your data is easy these days. You can buy an external USB hard drive pretty cheaply, and there are lots of free backup applications that work well. If you're using Apple's macOS or Microsoft's Windows, your OS has a decent backup application built-in.

For those of us on a limited budget, the off-site backup can be trickier. For a long time, I had two external hard-drives which I swapped every so often. While one wasn't in use, I stored it somewhere else (such as an office, or my parents' house). This works, but it means that your off-site backup is always quite old (depending on how often you go into the office, or visit your parents).

Today, "cloud storage" is fairly cheap. At first, I used it only as a mirror of my data rather than a proper backup: that means that I stored a copy of the current version of all my files in "the cloud" but I did not store all previous versions there too. This works well for emergencies: if my house burns down, I can restore my data from the cloud mirror. But if my data becomes corrupted, or I accidentally delete a file, then that corruption or deletion will be mirrored too. A proper backup stores all previous versions of data, so that you can recover from corruption or accidental deletions. The downside is that a proper backup obviously requires more storage space, so I used a mirror to save money. But as we get towards the end of 2021, perhaps cloud storage is now cheap enough to use as a proper backup.

My family's main computer is an Apple iMac running macOS, so I backup to an external hard drive with Apple's Time Machine. This works reasonably well. I then synchronize my laptop and other devices with the iMac (using Unison or rsync) so that everything is backed up in one place. This gives me the "2" of the "3-2-1".

For the off-site copy, I backup to Backblaze B2 with rclone. I originally chose B2 just because it is so much cheaper than most other cloud storage services, but I've actually since been really happy with the functionality that it offers. I have a very simple script:

$ cat /usr/local/bin/offsite-backup.sh
#!/bin/sh
set -eux

/usr/local/bin/rclone --fast-list --transfers=32 sync /Users/cc/Documents b2:cc-docs
/usr/local/bin/rclone --fast-list --transfers=32 sync /Users/cc/Pictures b2:cc-pics

Where cc-docs and cc-pics are B2 buckets, and my rclone configuration looks like this:

$ cat /Users/cc/.config/rclone/rclone.conf
[b2]
type = b2
account = c981ad2e78f9
key = 619d8af581bb867f157de44f2fae976e3cdbb1d180
hard_delete = true

Note that hard_delete option. That means that if I delete a file locally, it will also be deleted from my B2 backup (making it more of a mirror than a backup). Omitting that option would make deleted files be marked as hidden in B2 but still retrievable if necessary. You can also choose to enable versioning when you create the B2 buckets, which will preserve old versions of files that you modify.

This script is automatically run nightly at 3:30am by launchd. I want it to run regardless of who is logged in, so it needs to be a LaunchDaemon rather than a LaunchAgent. And it will run as root rather than my own user because it will backup files from other family members' home directories as well as my own.

Ideally, I would have a separate user for running backups and everyone would ensure that the files they want backed up are readable by that user. But that's not going to happen any time soon in this household so I'd rather take the risk of running rclone as root than the risk of failing to backup something important:

$ cat /Library/LaunchDaemons/com.cc.rclone.plist 
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
	<dict>
		<key>Label</key>
		<string>com.cc.rclone</string>
		<key>EnvironmentVariables</key>
		<dict>
			<key>XDG_CONFIG_HOME</key>
			<string>/Users/cc/.config</string>
		</dict>
		<key>Program</key>
		<string>/usr/local/bin/offsite-backup.sh</string>
		<key>StandardOutPath</key>
		<string>/var/log/rclone.stdout</string>
		<key>StandardErrorPath</key>
		<string>/var/log/rclone.stderr</string>
		<key>StartCalendarInterval</key>
		<dict>
			<key>Hour</key>
			<integer>3</integer>
			<key>Minute</key>
			<integer>30</integer>
		</dict>
	</dict>
</plist>

So now I have my complete 3-2-1 backup strategy. All is well, right? Apparently not. One day in September this year, my iMac upgraded itself from macOS 10.15.7 to... macOS 10.15.7. It was apparently a security patch, which doesn't seem to warrant a version number change in Apple's naming scheme. After this upgrade, I found that a large number of files in my home directory were corrupted. Parts of the files were overwritten with what looked like random bytes. By comparing the Time Machine backups from right before and right after the upgrade, I can see that all the data corruption occurred during the upgrade. This is bad: I feel like I shouldn't have to say this, but an OS upgrade - especially one so minor that it doesn't even change the version number - really shouldn't write random bytes all over my documents and photos. But I have backups, so I can just restore from the backups and all will be well, right?

Unfortunately, I did not notice the corruption right away. It was only weeks later when I was going through old photos that I discovered that many of them were now junk. By this time, I had made many deliberate changes to my data: I had edited photos and documents. So I couldn't just restore from a backup, I had to work out which changes were deliberate and which were corruption, then retain the deliberate changes but overwrite the corruption with a pre-upgrade backup. This was a real pain.

How could I have avoided this? How could I have detected the corruption immediately, so that I could restore from a backup immediately before any deliberate changes were made? I don't see how this is possible without manual intervention. Clever file systems such as ZFS wouldn't really help here, as ZFS wouldn't know the difference between me deliberately changing a file and a macOS upgrade writing crap all over my data. Only I know which files I meant to change and which I didn't.

So I have now added a fourth component to my backup strategy. I have an account with rsync.net to which I make manual backups using rsync. I have configured an alert in the rsync.net Account Manager to email me if my data doesn't change in more than a week, to remind me to run the manual backup at least weekly. Then I run the backup using a script like this:

$ cat backup-to-rsync.net.sh
#!/bin/sh

arg=$1

set -eu

if [ -z "$arg" ]; then
	echo "Usage: $0 {dry|real}" >&2
	exit 1
fi

case "$arg" in
	real)
		dry_run_option=
		;;
	*)
		dry_run_option=-n
		;;
esac

echo Syncing Documents...
rsync -avzh --progress $dry_run_option --delete --delete-excluded --exclude=.DS_Store /Users/cc/Documents rsync.net:cc

This syncs my local ~/Documents directory with ~/cc/Documents in my rsync.net account. Most importantly, I can run it like this:

./backup-to-rsync.net.sh dry

And it will show me a list of all the files that have changed since the last backup. If and only if those changes are the changes that I expect to see, I can run the script again with the argument real to actually perform the sync. If I see changes that I don't expect to see, it's time to investigate.

This is a bit shit. Even if I run this every week, I could still have a whole week's worth of deliberate changes to unpick if I do detect corruption. But I don't want to run this more often than weekly, because it is a manual process and I am lazy. Unfortunately, I cannot see a way to automate this: only I can look at a list of changes and decide whether or not they are the changes that I meant to make.

You might also wonder why I am using rsync.net for this, rather than just creating duplicate B2 buckets and running rclone --dry-run. That would also work, but I am wary of depending too much on a single vendor. One other thing I discovered when trying to undo this corruption is that Time Machine had not backed up all my changes. Although Time Machine runs hourly, it does not retain all the backups. It keeps hourly backups for a day, then daily backups for a month, and then weekly backups after that. So if you make a change mid-week, i.e. less than a week before the file is changed again (in this case, less than a week before the corruption occurred), your change will be lost. In contrast, a B2 bucket with versioning will retain all previous versions of every file. For a few files, I had to go to B2 to recover my edits.

That means that the off-site backup is not just for house-burning-down emergencies: it sometimes needs to function as a regular backup if Time Machine has missed something. In that case, as great as Backblaze has been so far, I don't want to put all my eggs in one basket. The cloud is just someone else's computer and all that.

So I decided to use rsync.net for my off-site golden record, to add a bit of redundancy. I like that they give me SSH access to a ZFS file system and I can do whatever I want with it, including setting my own schedule for creating immutable ZFS snapshots. They aren't so cheap... but I was pretty pissed off when I discovered that some of my wedding photos and baby photos had been corrupted, so I'm now willing to pay for it.