Puzzles with coreutils (Part 1)

28 May 2015

Reading through an article, What every computer science major should know, I came across a couple of interesting coreutils brain teasers under “The Unix philosophy” to play with of an evening. They turned out to be a great way to learn about some more tools, and to brush up on some of those I already knew.

Find the five folders in a given directory consuming the most space.

This one was pretty simple, though I still learned a new trick! First, I count the amount of space each directory is using, then sort from largest to smallest, strip the first line (which will be .), then cut out the directory sizes.

du | sort -nr | tail -n +2 | cut -f2-

I couldn’t find a good way to display the human readable size without counting twice, though this serverfault answer suggests that it might be possible with GNU coreutils >= 7.5.

But, I did learn something new! I wasn’t aware that you could pass +x to tail to have it start from the xth line.

Report duplicate MP3s (by file contents, not file name) on a computer.

This one was quite a bit more complicated! At a high level, it seemed like what I wanted here was to hash each file, then compare hashes across all files to find duplicates.

Hashing each file is trivial, as is figuring out which of the hashes are duplicates, but mapping them back to files is a little difficult (given that I’d also decided I was only going to use coreutils for this!)

The solution I ended up with is far from ideal - I end up hashing all the files twice. My original solution also contained a second call to xargs sh to get a subshell, till Theo showed me this cool bash-only subshell trick!

The core idea is that we find hashes for each file, cut out all but those that are repeated at least once, then search for files with any of those hashes.

find . -name '*.mp3' | xargs shasum | \
grep -f <(find . -name '*.mp3' | xargs shasum | cut -f1 -d' ' | sort | uniq -d) | \
cut -d' ' -f3-

I learned a bunch from this task!

I finally bothered to look up why find has such strange syntax, and as a consequence never works as I expect. Turns out, the -name part is a pattern, which has to come after the folder you’re searching in.
I discovered that find supports -exec as an option, and that it’s terrifying and should only be used with lots of care.
While looking these up, I discovered that a better command to find your IP address than just running ifconfig is ipconfig getifaddr en0. Relatedly, ipconfig is a thing on non-Windows systems!
xargs is much less intimidating than I always thought it was, it turns out it’s really easy to use!
grep can take patterns from a file! I’d seen this before, but it totally slipped my mind till Theo brought it up
tr '\\n' '' is the best way to get rid of newlines from stdin (from an earlier solution)
<(...) is a magic bash trick to create a pseudo-file, that will act as a file object, without actually touching disk (via the magic of /dev/fd/)

Take a list of names whose first and last names have been lower-cased, and properly recapitalize them.

I did some research into this, and discovered that I think this is only possible with the GNU version of sed. Unfortunately, I don’t have that, so I resorted to cheating:

python -c 'import sys; sys.stdout.write(sys.stdin.read().title())'

Find all words in English that have x as their second letter, and n as their second-to-last.

This one was also very familiar to me - I often use my computer to cheat at scrabble.

egrep '^.[xX].*[nN].$' /usr/share/dict/words

This does break in the case of the second-to-last letter preceding the second letter, but since this can only happen in the “xn” case (and that isn’t a word), I don’t think it’s a huge deal.