Sunday, March 3, 2024

Rant: Ubuntu RAID checking

In addition to a 40-year career as a software engineer specializing in operating systems, I've also managed corporate and personal computer systems. It should be no surprise, then, that my primary home computer system has RAID 1 arrays for several critical file systems.

Some years ago, while far from home, I wanted to demonstrate the utility of RAID arrays to a client. As I explained the up-time benefits in case of a failure, I logged into this system remotely and displayed the array status. To my surprise I found that one of the drives in the array had failed! The system continued to run, of course, because of the redundancy. This perfectly illustrated my point to my client.

RAID arrays need periodic checking to identify any errors that may have developed. For years this was kicked off by a cron job at 1 am local time on the first Sunday of each month. However, apparently with the adoption of systemd a few years ago, the start time changed. It now starts at a random time in the 24 hour period after 1 am.

Today it randomly started at 9:33 am, which was shortly after I sat down at my computer. It's now 12:30 pm, and these checks will run for another two hours. While this is running, applications randomly freeze as they compete for access to the file systems.

What brain-dead idiot thought that starting this at a random time on a Sunday was a good idea?

I found a discussion on StackExchange that describes how to control this. In summary, use this command to edit the start configuration:

sudo systemctl edit --full mdcheck_start.timer

I changed the check so it fires up at 2 am the first Sunday of the month, and reduced the randomness from 24 hours to 10 minutes:

[Timer]
#OnCalendar=Sun *-*-1..7 1:00:00
#RandomizedDelaySec=24h
OnCalendar=Sun *-*-1..7 2:00:00
RandomizedDelaySec=10m

If the check runs for more than 6 hours it will stop and be continued later. The default is to restart the next day between midnight and noon. Again, this is stupid. Can you imagine starting work on a Monday morning, only for your computer to become nearly useless later that morning?

To change this behavior, use this command:

sudo systemctl edit --full mdcheck_continue.timer

Experience shows my system's checks shouldn't take six hours. But to be thorough, I've changed this to restart the next day at 2 am:

[Timer]
#OnCalendar=daily
#RandomizedDelaySec=12h
OnCalendar=*-*-* 02:00:00
RandomizedDelaySec=10m

Why have a random start time at all? The idea is that it's bad to have a bunch of activities all fire off at the same time. The randomness spreads out their start times a bit


To forestall the inevitable bleating of "RAID is not backup!!1!", let me assure you I'm well aware of this. When I started my consulting practice 20 years ago, my first client was a company that manufactured RAID controllers. 'Nuf said.

My use of RAID on this system helps protect against disk failures causing the system to crash and become unusable for an extended period. When (not if) a disk fails, I have the opportunity to replace it without major disruption of my life. Backups, on the other hand, go to a network-attached storage system with a 30TB RAID 5 array, and that gets backed up nightly to off-site storage.

No comments:

Post a Comment