Tuesday, September 19, 2017

Potential problem to start name node after kerberizing HDP2.6

hi all,

Not sure this could be the potential problem that you will be facing. But, I have faced the same problem for twice when enable the kerberos on HDP2.6. Here is the log that you can find out from the log file.

  1. 2017-09-19 02:56:17,375 - Execute['ambari-sudo.sh su hdfs -l -s /bin/bash -c 'ulimit -c unlimited ; /usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh --config /usr/hdp/current/hadoop-client/conf start namenode''] {'environment': {'HADOOP_LIBEXEC_DIR': '/usr/hdp/current/hadoop-client/libexec'}, 'not_if': 'ambari-sudo.sh -H -E test -f /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid && ambari-sudo.sh -H -E pgrep -F /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid'}
  2. 2017-09-19 02:56:21,432 - Execute['/usr/bin/kinit -kt /etc/security/keytabs/hdfs.headless.keytab hdfs-hiuy@EXAMPLE.COM'] {'user': 'hdfs'}
  3. 2017-09-19 02:56:25,454 - Waiting for this NameNode to leave Safemode due to the following conditions: HA: False, isActive: True, upgradeType: None
  4. 2017-09-19 02:56:25,454 - Waiting up to 19 minutes for the NameNode to leave Safemode...
  5. 2017-09-19 02:56:25,454 - Execute['/usr/hdp/current/hadoop-hdfs-namenode/bin/hdfs dfsadmin -fs hdfs://ip-172-31-9-254.ap-southeast-1.compute.internal:8020 -safemode get | grep 'Safe mode is OFF''] {'logoutput': True, 'tries': 115, 'user': 'hdfs', 'try_sleep': 10} safemode: Call From ip-172-31-9-254.ap-southeast-1.compute.internal/172.31.9.254 to ip-172-31-9-254.ap-southeast-1.compute.internal:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
  6. 2017-09-19 02:56:27,484 - Retrying after 10 seconds. Reason: Execution of '/usr/hdp/current/hadoop-hdfs-namenode/bin/hdfs dfsadmin -fs hdfs://ip-172-31-9-254.ap-southeast-1.compute.internal:8020 -safemode get | grep 'Safe mode is OFF'' returned 1. safemode: Call From ip-172-31-9-254.ap-southeast-1.compute.internal/172.31.9.254 to ip-172-31-9-254.ap-southeast-1.compute.internal:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
After all, when you dig more details at the hadoop-hdfs.log. You will find out log file as below.

Caused by: javax.security.auth.login.LoginException: Receive timed out at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:808) at com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) at 


Please pay attention at the highlighted line. With the hint above, basically we can find out cause of problem due to the communication time out in between the KDC server. There is a useful link that documented the problem.



The solution on this problem will be adding a line to krb5.conf under the [libdefaults] section:

udp_preference_limit = 1

There is a small trick to make this setting available on all nodes within the cluster. You have to go to ambari > kerberos > Configs > Advanced krb5-conf to make the change.

Hope that helps. Thanks for reading.