Nuking boot-sectors and LVs

Why you don’t sysadmin when you’re biologically on the fritz.

October 27th, 2017

About two weeks ago I needed to do a clean installation of a particular system (decided I’d guinea-pigged it enough, and wanted to configuration-management it before deployment).

Same ol’ fish out the USB thumb-drive, plug it in, and dd the ISO onto the drive. Done! Plug USB into the target system, and boot…

What in Sam Hill? This ain’t the ISO I wrote!

Wait. What was that dd again?

# dd if=beware_dragons.iso \
> of=/dev/sda \
> bs=1M \
> status=progress

Well whoop-de-doo. I just nuked my hard-drive.

What’s the damage?

Let’s keep the system on to backup all the important data onto a temporary system, while I assess the situation.

The ISO I wrote was about 600MiB. There goes:

  1. the boot-sector
  2. the partition-table
  3. the 500MiB boot partition
  4. 100MiB of the Linux LVM partition

The first three are easy/trivial to replace, and don’t really cost much (besides time).

But number four there is a problem; no doubt I’ve lost meta-data describing the layout of my logical-volumes and some actual data on the volumes themselves (hopefully just root and var).

Let’s see…

# lvs
#
# vgs
# 
# pvs
#

Yep. No meta-data right down to the PVs.

100MiB might have just overwritten some data on the root LV and left all the other LVs alone. But now that the meta-data is gone it’s almost as good as gone.

Good thing I didn’t shut this machine down. There might be a way to get these LVs back. The partitions I can restore with fdisk (and lucky me has a simple partitioning scheme).

There should be something for LVs, right?

Unexpected tools and backups

And there is! Enter vcfgrestore, which can restore a VG along with it’s LVs from a VG description file. LVM stores backups of them in the /etc/lvm/backup and /etc/lvm/archive directories.

Just one extra step: the man-page says that if you’ve lost access to the PV meta-data, you can just recreate it with the correct UUID and the description file. Doesn’t touch any data: it just writes a PV header file to the first sector of the PV.

Applies to us, so here we go:

# pvcreate --uuid "bcec4c02-3d82-4d15-ba89-ca5ab28678ae" \
> --restorefile /etc/lvm/backup/ural /dev/sda2
# pvs
  PV         VG   Fmt  Attr PSize    PFree
  /dev/sda2  ural lvm2 a--  <931.02g <465.02g

Okay. PV shows up fine. Now for the restore:

# vgcfgrestore -f /etc/lvm/backup/ural ural
# lvs
  LV           VG   Attr       LSize  …
  root         ural -wi-ao----  15.00g
  swap         ural -wi-ao----  10.00g
  var          ural -wi-ao----  10.00g
  data         ural -wi-ao----  50.00g
  srv          ural -wi-ao----  10.00g
  home         ural -wi-ao---- 450.00g

Well, that went good. Let’s check to see which file-systems went to kaput-land.

# for f in root var data srv home
> do
>   e2fsck -f -n /dev/ural/${f} &> /dev/null \
>     && echo "${f}: OKAY" \
>     || echo "${f}: FAIL"
> done \

root: FAIL
var: FAIL
data: OKAY
srv: OKAY
home: OKAY

Well, that went good too. Re-installation aint’s as bad as losing data.

You learn something new everyday

I can count my lucky stars that the ISO was just so large enough to do minimal damage. Even so, there’s things we can take from this.

Drumroll.

Backups your disk layouts. If I had gotten into a panic and immediately shut my laptop down, I’d have lost all that data (assuming a low level scan wouldn’t work).

The partitioning scheme was simple, but if it wasn’t I probably would not have been able to do this so easily.

With LVM. I probably wouldn’t be able to restore the logical-volumes if there had’t been a backup, or the dd had managed to go forward to /etc territory.

Suffice to say the boot-sector, and the LVM configuration have been backed up.

Don’t sysadmin if you ain’t feeling too good. I’d been down with a bad case of the flu a week before this and had just gotten better. Aside from some weakness, I thought I was fine.

Nope. Bad idea. Weakness can do wonders to cognition. If you have to though, you should…

Actively practice that “hands of keyboard” wisdom, namely:

If you’re going to execute something that is dangerous or could harm your system, type it out and stop.

Take your hands off the keyboard and read the command again.

If it does what you want it to, hit enter.

Heavily paraphrased since I don’t remember the exact quote. Regardless, it’s a good safeguard against the instinct of hitting the enter button immediately after typing out a command.

Now you might read the command anyway. But taking your hands off the keyboard means that you won’t be hitting the enter button reflexively in that split second where you realize it’s wrong.

Captain obvious: dd is dangerous. Heck, so is every useful program (at least in this domain). But stuff like this happens, and we need a safe-guard: a wrapper script.

Here’s one on the fly:

#!/bin/bash

# isolate the of argument
OF_ARG="$(sed 's/.*of=//g;s/ .*$//' <<< $*)";

# see if OF_ARG is a drive we shouldn't touch
case $OF_ARG in
  /dev/sda*|/dev/ural*|/dev/mapper/ural*)
    echo "No siree. Ain't touching that.";
    exit 1;
    ;;
  *)
    echo "Okay. Going ahead.";
    ;;
esac

#/bin/dd $*

Go ahead and save that as /usr/local/bin/dd and try running dd with an output file we should’t touch.

# dd if=/dev/urandom of=/dev/sda bs=1M status=progress
No siree. Ain't touching that.

There! Now all you have to do is uncomment the last line and save it. Sure, we could do better and look to see if the target is mounted or not, but for now this is enough.

Well, that was fun. Just goes to show how screwing up can help you learn.