Dinosaurs are Forever: howto

Debugging hung machines can be a bit tricky. Here I'll document methods to trigger a crashdump when these hangs occur.

What exactly does it mean when a machine 'hangs' or 'freezes-up'? More information can be found in the kernel documentation [1], but overall there are a few types of hangs A "Soft Lock-Up" is when the kernel loops in kernel mode for a duration without giving tasks a chance to run. A "Hard Lock-Up" is when the kernel loops in kernel mode for a duration without letting other interrupts run. In addition a "Hung Task" is when a userspace task has been blocking for a duration. Thankfully the kernel has options to panic on these conditions and thus create a proper crashdump.

In order to setup crashdump, on an Ubuntu machine we can do the following. First we need to install and setup crashdump, more info can be found here [2].

sudo apt-get install linux-crashdump

Select NO unless you really would like to use kexec for your reboots.

Next we need to enable it since by default it is disabled.

sudo sed -i 's/USE_KDUMP=0/USE_KDUMP=1/' /etc/default/kdump-tools

Reboot to ensure the kernel cmdline options are properly setup

sudo reboot

After reboot run the following:

sudo kdump-config show

If this command shows 'ready to dump', then we can test a crash to ensure kdump has enough memory and will dump properly. This command will crash your computer, so hopefully you are doing this on a test machine.

echo c | sudo tee /proc/sysrq-trigger

The machine will reboot and you'll see a crash in /var/crash.

All of this is already documented in [2], so now we need to enable panics for hang and lockup conditions. Now we need to enable crashing on lockups, so we'll enable many cases at once.

Edit /etc/default/grub and change this line to the following:

GRUB_CMDLINE_LINUX="nmi_watchdog=panic hung_task_panic=1 softlockup_panic=1 unknown_nmi_panic"

In addition you could enable these via /proc/sys/kernel or sysctl. For more information about these parameters there is documentation here [3].

If you've used the command line change, update grub and then reboot.

sudo update-grub && sudo reboot

Now your machine should crash when it locks up, and you'll get a nice crashdump to analyze. If you want to test such a setup I wrote a module [4] that induces a hang to see if this works properly.

Happy hacking.

I purchased a cheap two drive USB enclosure in order to setup an external drive that had RAID-1 so I could backup photos and recordings.

First, I formatted both drives. Then I ran extended smart self-tests to ensure I had decent drives. With RAID-1 and two drives I can only tolerate 1 drive failure.

Next ensure mdadm is installed.

sudo apt-get install mdadm

Determine which dev devices the disks show up as.
Next, create the raid device pointing at the correct dev directory.

sudo mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sdx /dev/sdy
sudo mkfs.ext /dev/md0

The device will start a resync process which on my system takes a really long time (days). If you want to avoid this initial re-sync you can use '--assume-clean' to avoid this. I would recommend letting it resync.

And there ya go.

Dinosaurs are Forever

Friday, October 31, 2014

getting kernel crashdumps for hung machines

Monday, December 16, 2013

setup an external drives as raid1

My Blog List

About Me