Tar vs Rsync

Gnarly Backup Logo

Files getting lost or corrupted? A most heinous challenge, dude! So when strange things are afoot at the Linux-workstation, we totally hit our backups. We need to get started…but what commands should we use? We can use:

 

1. tar to make daily backups of all files,
2. or combine find and tar to back up changed files…
3. or use the magical rsync.

Some need more scripting to backup and some need more scripting to restore. Let’s hop in the phone booth and zoom through our options…

Tar: Whole Directories

Excellent: The clear advantage to tarring whole directories is that you are accounting for missing files when you restore from any particular point in time. You can also do a wholesale recovery very quickly, just by expanding the tarball for that directory. Below is an example of how to apply parallel compression to speed up your archive compression:

tar cf $backup.tbz2 --use-compress-program lbzip2 .

Bogus: Many of your data files are pretty large: megabytes, even gigabytes. Backing one of those up every day is likely going to be costly, especially when it rarely changes.

Tar: changed files

Excellent: We immediately start saving space. We can even detect when directories have had no changes since last backup and avoid making a misleading empty archive. You can detect when files disappear by creating a list of files with ls:

find -mindepth 0 -maxdepth 1 -type f | sort > .files
touch .recent
find -type f -newer .recent > backup-list
tar cf $backup.tbz2 --use-compress-program lbzip2 -T backup-list

The . filesfile will be picked up in the backup-list file. And while we’re here, let’s make a shortcut function for our tar command, so we can save our righteous keystrokes:

function Tarup() {
tar cf "$1" --use-compress-program lbzip2 $@
}
function Untar() {
tar xf "$1" --use-compress-program lbzip2 $@
}

Bogus: Restoration of this type of backup strategy is more difficult. To start a restoration, you have to first start with a full backup (a grandfather backup, yearly full, monthly full, or such). Then you have to apply each archive file, and at the end of that series, use your .files list and remove any files that were not present during the last backup.

cd /home
yesterday=2016-11-02
Untar /mnt/backups/yearly-home.tbz2
for f in /mnt/backups/home-*.tbz2 ; do
[ x$f == x/mnt/backups/home/home-$yesterday.tbz2 ] && break;
Untar $f
done
find -mindepth 0 -maxdepth 1 -type f > .newfiles
diff .files .newfiles | grep '^>'
read -p "These files will be deleted, proceed?" ANS
if [ x$ANS == xy ] ; then
diff .files .newfiles \
| grep '^>' \
| tr -d '>' \
| while read F; do rm -f $F ; done
done

You will have to verify this process. File names with spaces and subdirectories might not work with this example as I have coded it. This is why you totally verify your backup and restore process!

Rsync and daily backups

Excellent: there are a lot of advantages to rsync:

  • iffiles are modified but their mtimedoesn’t change they will still get backed up.
  • For simple backups you typically need little to no scripting.
  • It is most excellent with ssh!
  • A clever exclude syntax that comes with the command.
  • With the --deleteswitch you can remove files that were deleted on your disk.

So, rsyncis great if you want mirror directories.

Totally cool: rsynccan also do hard links across directories! You can save radical space by providing a previous backup directory on the backup file system to keep only one copy of the file. You use the --link-destswitch:

rsync -a --delete --link-dest=/mnt/backups/home-$yesterday.d \
/home /mnt/backups/home-$today.d

This requires a home-2016-11-06.d directory and creates the home-2016-11-07.ddirectory. Files deleted on the seventh are still there in sixth’s directory. Files that are the same are just hard-links between the directories home-2016-11-06.d and home-2016-11-07.d. (Refer to Jeff Layton’s article about using rsync for incremental backups.)

Bogus: Rsync might not be excellent for your needs:

  • No on-disk compression of backups (compression only over ssh)
  • A point-in-time set of backup uses a new directory for each day with the hard-link technique above. This requires scripting.
  • Be careful of the -b option! It means backup…but that renames each old file on the server directory you’re backing up to. If you have two years of backups, you’ll have 730 copies:
.ssh/known-hosts.729
.ssh/known-hosts.728
.ssh/known-hosts.727
...
.ssh/known-hosts
  • Millions of small files? Whoa, dude: rsynccan slow down with very large sets of small files. You might need to run a few rsync commands in parallel.
  • Limited memory? I’ve seen rsync take up hundreds of megabytes of memory backing when up hundreds of thousands of files (to store the list of file attributes for comparison). You might have to script your rsync(s) so they walk up a directory tree so as to only backup hundreds of files at a time.

Remember to backup! Stay Excellent!

Backing Up Only Recent Files

Gnarly Backup Logo

 

Gnarly Backup School Series

Getting backups done is important. Sometimes what you have to backup and what you want to backup make quite the contrast. Consider I never want to back up my .mozilla/firefox//cache directory. Let’s cover how to avoid that. Because if you don’t, things get really bunk, little dude.

Before you freak out man, prepare for some regular expressions. More regular than, huh and eh. “Everthing” separated by the gnarly pipe | is a pattern that will be matched by egrep.

IGNORE="$LOGNAME/(\\.mozilla/firefox/aq0d3bz0xy/cache|\\.gvfs|\\.cache|\\.Trash)'
cd $HOME/..
find $HOME -type f \
| egrep -v $IGNORE \
> /tmp/backup_list
tar cf /mnt/backups/backup.tgz --use-compress-program lbzip2 -T /tmp/backup_list

That will backup your whole home directory except anything you slap into IGNORE. If that includes a whole git tree or two, well, you might be waiting for a long time.
Therefore, let’s concentrate on things that just changed since the last backup.

cd $HOME/..
if [ ! -f $HOME/.recent ]; then
   find $HOME -type f \
   | egrep -v $IGNORE \
   > /tmp/backup_list
else
   find $HOME -type f \
   -newer $HOME/.recent \
   | egrep -v $IGNORE \
   > /tmp/backup_list
fi
touch $HOME/.recent
tar cf /mnt/backups/backup.tgz --use-compress-program lbzip2 -T /tmp/backup_list

Now we have a way to backup only recent things. Using all our cores, courtesy of lbzip2.

Finding the Offending Directories

Gnarly Backup Logo

Gnarly Backup School Series

 

We have washed up on the shores of find…all these files drifting around us in the surf. Remember: In the last article I asked you to try finding all the files in your home directory more recent than your last backup. Here, we’ll pull the tab on that can and take a sip:

 $ find ~/.config -type f -newer /var/backup/home-jed-2014-12-11.tgz | wc -l
3067

Oh man…3067 files! That’s way more than I’ve worked on recently, right? Something smells fishy…why are we getting so many files? Let’s snorkel around with the find program for a bit.

Follow along here: we are going to be doing two finds: the first is printing out the path of the files. The second is going to be piping them through sort and uniq. This gives us a more useful list of every entry just once. We loop through that unique list and get a count of each occurrence in the long file:

 

 $ find -type f -mtime -2 -printf "%h\n" \
   | sort \
   > /tmp/s-s.txt

The %h\n string prints the paths of the files out with a new line at the end. Any time you see
printf, no matter the language, you have to add your own newlines. (The \ line continuations stand out. That’s not how I type at the terminal, but it is how I write shell scripts. It’s quite legible, too.)

That find command gives us the list of paths found. Next, we do the same thing again but distill it with uniq so we don’t make a loop with duplicates. Any time you want to use
uniq, make sure you give it sorted input…it’s a stupidly simple program.

 

 $ find -type f -mtime -2 -printf "%h\n" \
   | sort \
   | uniq \
   > /tmp/s-u.txt

The difference between the s-s.txt file and the s-u.txt file is that the former has as many entries as files. We’re going to make a totally radical histogram of where those files show up.

 $ while read d ; do 
      N=`grep -c $d /tmp/s-s.txt`
      echo  "$N $d"
done \
< /tmp/s-u.txt \
| sort -n \
| tail 

6 ./.mozilla/firefox/grestf1r.default/weave/logs
8 ./.mozilla/firefox/grestf1r.default/datareporting
9 ./Documents
12 ./.mozilla/firefox/grestf1r.default/weave
66 ./.mozilla/firefox/grestf1r.default
2963 ./.cache/mozilla/firefox/grestf1r.default/cache2/entries
2964 ./.cache/mozilla/firefox/grestf1r.default/cache2
2968 ./.cache/mozilla/firefox/grestf1r.default
2970 ./.cache
3073 .

We’re awash in browser cache files! And we typically don’t want to waste disk or processor time
backing those up. (Unless company policy dictates.)

 

 

But is it accurate?

 

Woah…dude…great point! Our grep of the file s-s.txt above

 

N=`grep -c $d /tmp/s-s.txt`

is going to find all matches. This means that ./.cache is in almost all files in there, when by itself,
probably holds very little. A good system surfer can avoid this spray by using some regular expression goodness:

N=`grep -c "^$d\$" /tmp/s-s.txt`

Awesome difference! We got something almost totally different:

5 ./.mozilla/firefox/grestf1r.default/weave/changes
6 ./.lastpass
7 ./.mozilla/firefox/grestf1r.default/datareporting/archived/2015-11
7 ./.mozilla/firefox/grestf1r.default/saved-telemetry-pings
11 .
12 ./.thunderbird/qfp0l4ts.default/ImapMail/imap.googlemail.com
20 ./.thunderbird/qfp0l4ts.default/ImapMail/mail.candelatech.com
21 ./.mozilla/firefox/grestf1r.default
22 ./.thunderbird/qfp0l4ts.default
2963 ./.cache/mozilla/firefox/grestf1r.default/cache2/entries

So…dudes: really smart system surfers
need that shell trick above. It’s useful elsewhere, too.

A Systems Guru point of view…

If I could mount the .cache directory to a different device I probably would. It would be easier to write backup scripts…and not so much of this browser kelp, right? You can do the same thing with software build systems:
often a project has a src/ directory and a build/ directory. This moves the…like…totally ephemeral output of the build to be ignored by backups. Best of all, things like build/ directories can be backed by ram drives (using the tmpfs driver).

Righteous!

Other uses

This totally killer histogram technique is useful when parsing logs, too. Consider these codes:
404 is your typical file not found condition, 500 is your server error condition, and 200 is your normal success condition. (There are more, but this is enough for now.) Let’s grep through our /var/log/apache2/access_log file for how our web server is doing:

 $ echo -e "200\n404\n500\n" > codes.txt
 $ while read code ; do
     N=`fgrep -c " $code " access_log`
     echo "$code $N"
 done < codes.txt | sort -n

I’m not going to go deeper into that example because it’s straying off topic from find. (It’s cool to contact me to suggest a topic, right?)

But, Jed, dude? How do we avoid backing up those cache files? That’s for next episode. Go grab your boards and race down the beach, I’ll see you next week.

 

Finding Recent Files

Gnarly Backup Logo

Finding Recent Files

Gnarly Backup School Series

Before you suggest that it is better to use a backup program like Bacula or Amanda, I shall insist that making backups from the command-line is mighty useful. In scenarios where you are running in an embedded system (Rpi, Beaglebone), or a headless server that you want to keep a minimum install footprint on, writing a just-fits backup script can often be the correct choice.

This will be the first of a series of backup articles: we will learn various aspects of the find command, and move onto ways of using rsync and find in conjunction. I’m totally sure it will be a killer trip, man.

A bunk trip for most people is reading the man page for find. It’s not a command you learn from the man page. Instead, there are numerous options and endless possibilities for confusion…unless you have a tutorial. So, little dudes, let’s begin with the general features of find:

  • There’s a way to mark paths: -path and not descend them: -prune
  • You can make file name patterns with -name “*thing1*” and -iname “*thing2*”
  • And they can be combined with AND, OR and NOT: -a, -o, !
  • But don’t stop there. ALL aspects of the unix-style file system can be queried. The most likely things we want to know are file type:-type and last modified time: -mtime.

Excellent! Or well, that’s the theory. Let’s do three beginner examples:

  1. Find assumes the current directory, which is always the first argument.
    cd ~; find -type f

    That shows way too many files. Great for a first backup…not useful for a daily backup.

  2. Getting sharper- What config files did we modify in the last day?
    find .config -type f -mtime -1

    and the last two days?

    find .config -type f -mtime -2
  3. You and I know we don’t backup every day: that’s just life. Being smart, we’ll use the date of our last backup to build up the list of files we have modified since our last backup. Getting gnarly:
    find ~/.config -type f -newer /var/backup/home-jed-2014-12-11.tgz

I’ve done my best to keep this one simple. Next edition, we’ll be making more of our backup by taking less. Dwell on using the last command above, but on your home directory. Is that what you really want to back up? Let me know below in the comments section.