报错
CM HDFS管理界面的报错(由于CM down这个信息是无法通过管理界面查看的,这里是从日志中获得的):
The health test result for HDFS_CANARY_HEALTH has become bad: Canary test failed to create parent directory for /opt/tmp/.cloudera_health_monitoring_canary_files.
排查并处理
(1)CDH的CM节点挂掉
[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# /etc/init.d/cloudera-scm-server status
cloudera-scm-server dead but pid file exists
[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# /usr/java/jdk1.8.0_111/bin/jps
20656 Main
20626 Main
25667 Jps
20630 EventCatcherService
20632 AlertPublisher
29995 Main
10619 -- process information unavailable
#从这里可以看到,没有7180这个端口,说明CM没有正常启动,少了一个Main进程
[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# ss -nltup|grep 718*
tcp LISTEN 0 50 *:7184 *:* users:(("java",20630,233))
tcp LISTEN 0 50 *:7185 *:* users:(("java",20630,241))
tcp LISTEN 0 5 *:4433 *:* users:(("python2.6",17152,8))
tcp LISTEN 0 5 127.0.0.1:7190 *:* users:(("python2.6",17152,11))
tcp LISTEN 0 5 *:7191 *:* users:(("python2.6",17152,7))
#我们的CDH相关的数据是存放在MySQL数据库中,由于CM down,导致无法查看CDH的其他相关组件,所以需要查看数据库信息,看看这个CDH都包括哪些节点
mysql> select * from hosts;+---------+-------------------------+--------------------------------------+----------------------------+----------------+----------+--------+---------------------+-------------------+--------------------+------------+-----------+----------------------+-------------+-------------------+----------------+| HOST_ID | OPTIMISTIC_LOCK_VERSION | HOST_IDENTIFIER | NAME | IP_ADDRESS | RACK_ID | STATUS | CONFIG_CONTAINER_ID | MAINTENANCE_COUNT | DECOMMISSION_COUNT | CLUSTER_ID | NUM_CORES | TOTAL_PHYS_MEM_BYTES | PUBLIC_NAME | PUBLIC_IP_ADDRESS | CLOUD_PROVIDER |+---------+-------------------------+--------------------------------------+----------------------------+----------------+----------+--------+---------------------+-------------------+--------------------+------------+-----------+----------------------+-------------+-------------------+----------------+| 1 | 11 | 264b10bb-b488-4ee7-8fcd-3c68f7a8860a | ec6s-logshedcl58manager-01 | 10.177.101.146 | /default | NA | 1 | 0 | 0 | 5 | 2 | 8251195392 | NULL | NULL | NULL || 2 | 17 | b584457b-705d-4b1f-8000-df0e6da1838d | ec6s-logshedcl58dn-03 | 10.177.102.38 | /default | NA | 1 | 0 | 0 | 5 | 2 | 8251195392 | NULL | NULL | NULL || 3 | 16 | e28dabc1-c105-464e-8bf6-0bd0435ace9a | ec6s-logshedcl58dn-02 | 10.177.102.193 | /default | NA | 1 | 0 | 0 | 5 | 2 | 8251195392 | NULL | NULL | NULL || 4 | 17 | 994cf04e-2510-426a-8336-6e2d28a3001d | ec6s-logshedcl58nn-02 | 10.177.102.218 | /default | NA | 1 | 0 | 0 | 5 | 2 | 8251195392 | NULL | NULL | NULL || 5 | 16 | a9cab0d5-5e48-49a7-8fb0-e57a0bac16db | ec6s-logshedcl58nn-01 | 10.177.101.60 | /default | NA | 1 | 0 | 0 | 5 | 2 | 8251195392 | NULL | NULL | NULL || 6 | 16 | 60bf1721-d6db-4d72-9164-41d89f81e789 | ec6s-logshedcl58dn-01 | 10.177.101.64 | /default | NA | 1 | 0 | 0 | 5 | 2 | 8251195392 | NULL | NULL | NULL |+---------+-------------------------+--------------------------------------+----------------------------+----------------+----------+--------+---------------------+-------------------+--------------------+------------+-----------+----------------------+-------------+-------------------+----------------+6 rows in set (0.00 sec)mysql> select * from roles;+---------+----------------------------------------------------------+---------+--------------------+-------------------+------------+---------------+-------------------+--------------------+-------------------------+----------------------+------------------+| ROLE_ID | NAME | HOST_ID | ROLE_TYPE | CONFIGURED_STATUS | SERVICE_ID | MERGED_KEYTAB | MAINTENANCE_COUNT | DECOMMISSION_COUNT | OPTIMISTIC_LOCK_VERSION | ROLE_CONFIG_GROUP_ID | HAS_EVER_STARTED |+---------+----------------------------------------------------------+---------+--------------------+-------------------+------------+---------------+-------------------+--------------------+-------------------------+----------------------+------------------+| 14 | mgmt-HOSTMONITOR-92f15c379891f3c8dbdbbcbe57db9067 | 1 | HOSTMONITOR | RUNNING | 4 | NULL | 0 | 0 | 6 | 25 | 1 || 15 | mgmt-EVENTSERVER-92f15c379891f3c8dbdbbcbe57db9067 | 1 | EVENTSERVER | RUNNING | 4 | NULL | 0 | 0 | 6 | 21 | 1 || 16 | mgmt-ACTIVITYMONITOR-92f15c379891f3c8dbdbbcbe57db9067 | 1 | ACTIVITYMONITOR | RUNNING | 4 | NULL | 0 | 0 | 6 | 22 | 1 || 17 | mgmt-SERVICEMONITOR-92f15c379891f3c8dbdbbcbe57db9067 | 1 | SERVICEMONITOR | RUNNING | 4 | NULL | 0 | 0 | 6 | 24 | 1 || 18 | mgmt-ALERTPUBLISHER-92f15c379891f3c8dbdbbcbe57db9067 | 1 | ALERTPUBLISHER | RUNNING | 4 | NULL | 0 | 0 | 6 | 20 | 1 || 19 | zookeeper-SERVER-5779e83332b2c66cc02029a8ab2c3628 | 3 | SERVER | RUNNING | 5 | NULL | 0 | 0 | 9 | 27 | 1 || 20 | zookeeper-SERVER-c103ed4dcdd93fc8bbaf467aa1c6d927 | 2 | SERVER | RUNNING | 5 | NULL | 0 | 0 | 9 | 27 | 1 || 21 | zookeeper-SERVER-dc971e0a60f4e798e85e2ab9bd57a041 | 6 | SERVER | RUNNING | 5 | NULL | 0 | 0 | 9 | 27 | 1 || 23 | hdfs-NAMENODE-ed39ed17d751bee1bd6ad84c0db46ca1 | 5 | NAMENODE | RUNNING | 6 | NULL | 0 | 0 | 22 | 30 | 1 || 24 | hdfs-DATANODE-c103ed4dcdd93fc8bbaf467aa1c6d927 | 2 | DATANODE | RUNNING | 6 | NULL | 0 | 0 | 10 | 28 | 1 || 25 | hdfs-DATANODE-5779e83332b2c66cc02029a8ab2c3628 | 3 | DATANODE | RUNNING | 6 | NULL | 0 | 0 | 10 | 28 | 1 || 26 | hdfs-DATANODE-dc971e0a60f4e798e85e2ab9bd57a041 | 6 | DATANODE | RUNNING | 6 | NULL | 0 | 0 | 10 | 28 | 1 || 27 | hdfs-NAMENODE-16c21945a5f07e23a510dd5e32caa6dd | 4 | NAMENODE | RUNNING | 6 | NULL | 0 | 0 | 6 | 30 | 1 || 28 | hdfs-FAILOVERCONTROLLER-ed39ed17d751bee1bd6ad84c0db46ca1 | 5 | FAILOVERCONTROLLER | RUNNING | 6 | NULL | 0 | 0 | 4 | 29 | 1 || 29 | hdfs-FAILOVERCONTROLLER-16c21945a5f07e23a510dd5e32caa6dd | 4 | FAILOVERCONTROLLER | RUNNING | 6 | NULL | 0 | 0 | 2 | 29 | 1 || 30 | hdfs-JOURNALNODE-c103ed4dcdd93fc8bbaf467aa1c6d927 | 2 | JOURNALNODE | RUNNING | 6 | NULL | 0 | 0 | 2 | 34 | 1 || 31 | hdfs-JOURNALNODE-dc971e0a60f4e798e85e2ab9bd57a041 | 6 | JOURNALNODE | RUNNING | 6 | NULL | 0 | 0 | 2 | 34 | 1 || 32 | hdfs-JOURNALNODE-5779e83332b2c66cc02029a8ab2c3628 | 3 | JOURNALNODE | RUNNING | 6 | NULL | 0 | 0 | 2 | 34 | 1 || 36 | kafka-KAFKA_BROKER-c103ed4dcdd93fc8bbaf467aa1c6d927 | 2 | KAFKA_BROKER | RUNNING | 8 | NULL | 0 | 0 | 9 | 40 | 1 || 37 | kafka-KAFKA_BROKER-ed39ed17d751bee1bd6ad84c0db46ca1 | 5 | KAFKA_BROKER | RUNNING | 8 | NULL | 0 | 0 | 10 | 40 | 1 || 38 | kafka-KAFKA_BROKER-16c21945a5f07e23a510dd5e32caa6dd | 4 | KAFKA_BROKER | RUNNING | 8 | NULL | 0 | 0 | 10 | 40 | 1 |+---------+----------------------------------------------------------+---------+--------------------+-------------------+------------+---------------+-------------------+--------------------+-------------------------+----------------------+------------------+21 rows in set (0.00 sec)mysql> select * from services;+------------+-------------------------+-----------+--------------+------------+-------------------+-----------------------------+------------+| SERVICE_ID | OPTIMISTIC_LOCK_VERSION | NAME | SERVICE_TYPE | CLUSTER_ID | MAINTENANCE_COUNT | DISPLAY_NAME | GENERATION |+------------+-------------------------+-----------+--------------+------------+-------------------+-----------------------------+------------+| 4 | 14 | mgmt | MGMT | NULL | 0 | Cloudera Management Service | 1 || 5 | 7 | zookeeper | ZOOKEEPER | 5 | 0 | ZooKeeper | 1 || 6 | 23 | hdfs | HDFS | 5 | 0 | HDFS | 1 || 8 | 15 | kafka | KAFKA | 5 | 0 | Kafka | 1 |+------------+-------------------------+-----------+--------------+------------+-------------------+-----------------------------+------------+
#重启cloudera-scm-server服务
[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# /etc/init.d/cloudera-scm-server status
cloudera-scm-server dead but pid file exists
[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# /etc/init.d/cloudera-scm-server stop
cloudera-scm-server is already stopped
[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# cat /var/run/cloudera-scm-server.pid
10617
[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# ps -ef|grep 10617
root 28331 27755 0 19:02 pts/3 00:00:00 grep 10617
[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# kill 20656
[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# kill 20626
[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# kill 20630
[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# kill 29995
[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# kill 20632
[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# /etc/init.d/cloudera-scm-server start
[root@ec6s-logshedcl58manager-01 ~]# /etc/init.d/cloudera-scm-server status
cloudera-scm-server (pid 1378) is running...
#正常启动
[root@ec6s-logshedcl58manager-01 ~]# /usr/java/jdk1.8.0_111/bin/jps
1380 Main
2469 Main
2471 EventCatcherService
7272 Jps
2473 AlertPublisher
2475 Main
2462 Main
(2)两个NameNode之前无法通信,但是没有挂掉
当上面的CM正常起来之后,我们就可以通过图像界面管理NameNode,从图形界面上得到的信息是,NameNode彼此不能通信,NameNode无法写日志到Jounral Node中
日志报错:
Jul 18, 5:38:09.355 PMFATALorg.apache.hadoop.hdfs.server.namenode.FSEditLogError: flush failed for required journal (JournalAndStream(mgr=QJM to [10.177.101.64:8485, 10.177.102.193:8485, 10.177.102.38:8485], stream=QuorumOutputStream starting at txid 1338050))java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:651)at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:585)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2752)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2624)at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:599)at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.create(AuthorizationProviderProxyClientProtocol.java:112)at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:401)at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2141)at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)at java.security.AccessController.doPrivileged(Native Method)at javax.security.auth.Subject.doAs(Unknown Source)at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1783)at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2135)
从日志可以看出,NameNode写journal文件失败,导致NameNode超时,因为公司用的AWS ec2环境,可能但是在做网络维护,导致instance网络不稳定,如果出现timeout的情况,我们可以把默认的20s修改成60s,如
#vim /etc/hadoop/conf/hdfs-site.xml
<property>
<name>dfs.qjournal.write-txns.timeout.ms</name>
<value>60000</value>
</property>
然后可以通过CM的管理平台:http://10.177.101.146:7180 分别重启两个NameNode