Posted in How To

Finding the Offending Directories

Finding the Offending Directories Posted on November 25, 2015Leave a comment
Jed Reynolds has been known to void warranties and super glue his fingers together. When he is not doing photography or fixing his bike, he can be found being a grey beard programmer analyst for Candela Technologies. Start stalking him at https://about.me/jed_reynolds.
(Last Updated On: February 24, 2017)

Gnarly Backup School Series

 

We have washed up on the shores of find…all these files drifting around us in the surf. Remember: In the last article I asked you to try finding all the files in your home directory more recent than your last backup. Here, we’ll pull the tab on that can and take a sip:

 $ find ~/.config -type f -newer /var/backup/home-jed-2014-12-11.tgz | wc -l
3067

Oh man…3067 files! That’s way more than I’ve worked on recently, right? Something smells fishy…why are we getting so many files? Let’s snorkel around with the find program for a bit.

Follow along here: we are going to be doing two finds: the first is printing out the path of the files. The second is going to be piping them through sort and uniq. This gives us a more useful list of every entry just once. We loop through that unique list and get a count of each occurrence in the long file:

 

 $ find -type f -mtime -2 -printf "%h\n" \
| sort \
> /tmp/s-s.txt

The %h\n string prints the paths of the files out with a new line at the end. Any time you see
printf, no matter the language, you have to add your own newlines. (The \ line continuations stand out. That’s not how I type at the terminal, but it is how I write shell scripts. It’s quite legible, too.)

That find command gives us the list of paths found. Next, we do the same thing again but distill it with uniq so we don’t make a loop with duplicates. Any time you want to use
uniq, make sure you give it sorted input…it’s a stupidly simple program.

 

 $ find -type f -mtime -2 -printf "%h\n" \
| sort \
| uniq \
> /tmp/s-u.txt

The difference between the s-s.txt file and the s-u.txt file is that the former has as many entries as files. We’re going to make a totally radical histogram of where those files show up.

 $ while read d ; do 
N=`grep -c $d /tmp/s-s.txt`
echo  "$N $d"
done \
< /tmp/s-u.txt \
| sort -n \
| tail 
6 ./.mozilla/firefox/grestf1r.default/weave/logs
8 ./.mozilla/firefox/grestf1r.default/datareporting
9 ./Documents
12 ./.mozilla/firefox/grestf1r.default/weave
66 ./.mozilla/firefox/grestf1r.default
2963 ./.cache/mozilla/firefox/grestf1r.default/cache2/entries
2964 ./.cache/mozilla/firefox/grestf1r.default/cache2
2968 ./.cache/mozilla/firefox/grestf1r.default
2970 ./.cache
3073 .

We’re awash in browser cache files! And we typically don’t want to waste disk or processor time
backing those up. (Unless company policy dictates.)

 

 

But is it accurate?

 

Woah…dude…great point! Our grep of the file s-s.txt above

 

N=`grep -c $d /tmp/s-s.txt`

is going to find all matches. This means that ./.cache is in almost all files in there, when by itself,
probably holds very little. A good system surfer can avoid this spray by using some regular expression goodness:

N=`grep -c "^$d\$" /tmp/s-s.txt`

Awesome difference! We got something almost totally different:

5 ./.mozilla/firefox/grestf1r.default/weave/changes
6 ./.lastpass
7 ./.mozilla/firefox/grestf1r.default/datareporting/archived/2015-11
7 ./.mozilla/firefox/grestf1r.default/saved-telemetry-pings
11 .
12 ./.thunderbird/qfp0l4ts.default/ImapMail/imap.googlemail.com
20 ./.thunderbird/qfp0l4ts.default/ImapMail/mail.candelatech.com
21 ./.mozilla/firefox/grestf1r.default
22 ./.thunderbird/qfp0l4ts.default
2963 ./.cache/mozilla/firefox/grestf1r.default/cache2/entries

So…dudes: really smart system surfers
need that shell trick above. It’s useful elsewhere, too.

A Systems Guru point of view…

If I could mount the .cache directory to a different device I probably would. It would be easier to write backup scripts…and not so much of this browser kelp, right? You can do the same thing with software build systems:
often a project has a src/ directory and a build/ directory. This moves the…like…totally ephemeral output of the build to be ignored by backups. Best of all, things like build/ directories can be backed by ram drives (using the tmpfs driver).

Righteous!

Other uses

This totally killer histogram technique is useful when parsing logs, too. Consider these codes:
404 is your typical file not found condition, 500 is your server error condition, and 200 is your normal success condition. (There are more, but this is enough for now.) Let’s grep through our /var/log/apache2/access_log file for how our web server is doing:

 $ echo -e "200\n404\n500\n" > codes.txt
$ while read code ; do
N=`fgrep -c " $code " access_log`
echo "$code $N"
done < codes.txt | sort -n

I’m not going to go deeper into that example because it’s straying off topic from find. (It’s cool to contact me to suggest a topic, right?)

But, Jed, dude? How do we avoid backing up those cache files? That’s for next episode. Go grab your boards and race down the beach, I’ll see you next week.

 

More great Linux goodness!

Jed Reynolds
Jed Reynolds has been known to void warranties and super glue his fingers together. When he is not doing photography or fixing his bike, he can be found being a grey beard programmer analyst for Candela Technologies. Start stalking him at https://about.me/jed_reynolds.

Leave a Reply

Please Login to comment
  Subscribe  
Notify of