Tuesday, June 24, 2014

Managing bad disks in Hadoop cluster

hi all,

I was tasked to do some log files massaging and pipe them onto a specific custom log. The whole notion is to detect any of the disk failures that happened on the hadoop cluster. For example, if we are running a 30+ nodes in a hadoop cluster, with each node having 10 local disks attached onto the it. The possibility of having worn/bad disks is really high when the cluster is serving for a business for a period of times. So, instead of proactively scanning the disks on 30+ nodes everyday, we can make use of the existing tool to help us to achieve our goal.

These are the existing tool.

1. rsyslog
2. any sort of monitoring tool, like nagios, or OVO agent/BMC patrol agent.


I want to document on the steps how I manage to harvest the disk failure messages from the standard system log, /var/log/message*.

1. put a conf file like this.

[root@centos65-1 ~]# cat /etc/rsyslog.d/hadoop.conf
:msg, contains, "offline"    /var/log/hadoop_disk.log


2. touch the /var/log/hadoop_disk.log

3. restart the rsyslog daemon.


Voila you are done!


Now, we want to test it. Here are the steps.

1. Create a small disk from the VM, carve it, format, and mount it.

 [root@centos65-1 ~]# df /mnt/test
Filesystem     1K-blocks  Used Available Use% Mounted on
/dev/sda1        2063504 35840   1922844   2% /mnt/test


2. Offline the disk state.

[root@centos65-1 ~]# echo "offline" > /sys/block/sda/device/state
[root@centos65-1 ~]#


3. Jump/cd onto the mount point.

[root@centos65-1 ~]# cd /mnt/test
[root@centos65-1 test]# ls
ls: reading directory .: Input/output error


Now, you should be getting the system log files redirected to /var/log/hadoop_disk.log.

[root@centos65-1 ~]# tail -f /var/log/hadoop_disk.log
Jun 25 10:15:06 centos65-1 kernel: sd 0:0:0:0: rejecting I/O to offline device
Jun 25 10:15:12 centos65-1 kernel: sd 0:0:0:0: rejecting I/O to offline device
Jun 25 10:15:12 centos65-1 kernel: sd 0:0:0:0: rejecting I/O to offline device
Jun 25 10:15:12 centos65-1 kernel: sd 0:0:0:0: rejecting I/O to offline device


Good stuffs! This is what we can do achieve. Next step, we can configure our monitoring tool to watch this log file. And eventually generate ticket to alerting the business.


No comments: