Wednesday, August 2, 2017

Parsing XML hadoop files

hi all,

I find reading the cloudera hadoop xml files are a most tedious job in this world. Partly, some of the xml files, e.g. hdfs-site.xml and yarn-site.xml are too long and repetitive. So, I am thinking of an idea to read the xml files with a python parser and print it out to stdout and to an output file (.out). It has the name = value pattern. With that, my eyes will be way.. comfortable reading it. I would like to share the simple python code with you.

from xml.etree import ElementTree
import os
import sys

input_file_name = sys.argv[1]
full_file_name = os.path.abspath(input_file_name)
file_name = input_file_name.split('.')[0]+'.out'

dom = ElementTree.parse(full_file_name)
property = dom.findall('property')

with open(file_name, 'w') as write_xml_conf:
  for p in property:
    name = p.find('name').text
    value = p.find('value').text
    print("{} = {}".format(name, value))
    write_xml_conf.write("{} = {}\n".format(name, value))

Here is the output of the script. With the standard output printed on the screen, I have written the same output to a file too.

[root@ip-172-31-9-77 21-yarn-NODEMANAGER]# decode_xml.py hdfs-site.xml
dfs.namenode.name.dir = file:///dfs/nn
dfs.namenode.servicerpc-address = ip-172-31-14-234.ap-southeast-1.compute.internal:8022
dfs.https.address = ip-172-31-14-234.ap-southeast-1.compute.internal:50470
dfs.https.port = 50470
dfs.namenode.http-address = ip-172-31-14-234.ap-southeast-1.compute.internal:50070
dfs.replication = 3
dfs.blocksize = 134217728
dfs.client.use.datanode.hostname = false
fs.permissions.umask-mode = 022
dfs.namenode.acls.enabled = false
dfs.client.use.legacy.blockreader = false
dfs.client.read.shortcircuit = false
dfs.domain.socket.path = /var/run/hdfs-sockets/dn
dfs.client.read.shortcircuit.skip.checksum = false
dfs.client.domain.socket.data.traffic = false
dfs.datanode.hdfs-blocks-metadata.enabled = true

With this small script, I could discover more xml files from cloudera, especially those reside at /var/run/cloudera-scm-agent/process/

[root@ip-172-31-9-77 process]# cd /var/run/cloudera-scm-agent/process/
[root@ip-172-31-9-77 process]# find . -type f -name *.xml
./38-yarn-NODEMANAGER/yarn-site.xml
./38-yarn-NODEMANAGER/mapred-site.xml
./38-yarn-NODEMANAGER/ssl-server.xml
./38-yarn-NODEMANAGER/hdfs-site.xml
./38-yarn-NODEMANAGER/hadoop-policy.xml
./38-yarn-NODEMANAGER/ssl-client.xml
./38-yarn-NODEMANAGER/core-site.xml
./43-yarn-RESOURCEMANAGER/yarn-site.xml
./43-yarn-RESOURCEMANAGER/ssl-server.xml
./43-yarn-RESOURCEMANAGER/fair-scheduler.xml
./43-yarn-RESOURCEMANAGER/hdfs-site.xml
./43-yarn-RESOURCEMANAGER/core-site.xml
./43-yarn-RESOURCEMANAGER/capacity-scheduler.xml
./43-yarn-RESOURCEMANAGER/hadoop-policy.xml
./43-yarn-RESOURCEMANAGER/mapred-site.xml
./43-yarn-RESOURCEMANAGER/ssl-client.xml
./34-hdfs-DATANODE/hdfs-site.xml
./34-hdfs-DATANODE/ssl-client.xml
./34-hdfs-DATANODE/core-site.xml
./34-hdfs-DATANODE/hadoop-policy.xml
./34-hdfs-DATANODE/ssl-server.xml
./34-hdfs-DATANODE/hdfs-site-refreshable.xml
./40-yarn-JOBHISTORY/yarn-site.xml
./40-yarn-JOBHISTORY/mapred-site.xml
./40-yarn-JOBHISTORY/core-site.xml
./40-yarn-JOBHISTORY/ssl-server.xml
./40-yarn-JOBHISTORY/hdfs-site.xml
./40-yarn-JOBHISTORY/hadoop-policy.xml
./40-yarn-JOBHISTORY/ssl-client.xml