System and versions:
====================
Novell Cluster with 9 nodes and 44 cluster volumes under
SLES11SP1/OES11 with the last patches applied
Hardware: HP Blades and HP Storage (EVA7000)
Problem:
========
For some weeks we have a problem with cluster volumes and nds daemons. Almost daily the cluster volumes of one cluster node become unavailable for the clients. The command
# rcndsd status returns "Unable to get server status".
In this case the server in question has to be rebooted. The cluster volumes (because of rebooting) migrate to other cluster nodes and become functionable again.
After some hours another server shows the same symptoms. It seems to be that there are one or two "favorite" cluster volumes that are always involved.
Additional information:
=======================
1.)
I applied TD 7012793 to one cluster node. The only change: When the cluster volumes become unavailable to clients, the command
# rcndsd status returns no error in this case. But when the cluster volume is migrated (by iManager) the ndsd of the server from wich is migrated crashes with "dead" as the return value of the rcndsd status command.
2.)
A piece of /var/log/messages
I migrated by iManager the cluster volumen C3-NL3K12P-SERVER that became unavailable for clients from the server nc308
Sep 25 06:13:01 nc308 /usr/sbin/cron[22602]: (root) CMD (/usr/sbin/smt-agent)
Sep 25 06:14:48 nc308 [XTCOM]: pam_sm_authenticate in pam_ncl.c (novell-client's pam)is called
Sep 25 06:15:01 nc308 /usr/sbin/cron[22639]: (root) CMD ( /opt/hp/hp-health/bin/check-for-restart-requests)
Sep 25 06:16:15 nc308 sshd[22665]: Accepted keyboard-interactive/pam for root from 172.20.144.40 port 58548 ssh2
Sep 25 06:19:28 nc308 smdrd[16219]: Received Leave Event for C3-NL3K12P-SERVER
Sep 25 06:19:28 nc308 smdrd[16219]: Target name C3-NL3K12P-SERVER successfully de-advertised from SLP
Sep 25 06:19:28 nc308 kernel: [54445.897985] ndsd[22110]: segfault at 58 ip 00007fb6b44962b9 sp 00007fb69cec1be0 error 4 in libncpengine.so.0.0.0[7fb6b4429000+105000]
Sep 25 06:19:29 nc308 smdrd[16219]: Could not start TCP listener on 172.20.144.50
Sep 25 06:19:32 nc308 adminus daemon: umounting volume NL3K12S lazy=1
Sep 25 06:19:34 nc308 kernel: [54451.742301] NSSLOG ==> [MSAP] comnLog[201]
Sep 25 06:19:34 nc308 kernel: [54451.742303] Pool "NL3K12P" - MSAP deactivate.
Sep 25 06:20:01 nc308 /usr/sbin/cron[22848]: (root) CMD ( /opt/hp/hp-health/bin/check-for-restart-requests)
Sep 25 06:21:50 nc308 shutdown[22906]: shutting down for system reboot
Sep 25 06:21:51 nc308 init: Switching to runlevel: 6
Sep 25 06:21:53 nc308 kernel: [54591.102010] bootsplash: status on console 0 changed to on
Sep 25 06:21:57 nc308 multipathd: 36001438012599fc20000400000c40000: stop event checker thread (140680465872640)
====================
Novell Cluster with 9 nodes and 44 cluster volumes under
SLES11SP1/OES11 with the last patches applied
Hardware: HP Blades and HP Storage (EVA7000)
Problem:
========
For some weeks we have a problem with cluster volumes and nds daemons. Almost daily the cluster volumes of one cluster node become unavailable for the clients. The command
# rcndsd status returns "Unable to get server status".
In this case the server in question has to be rebooted. The cluster volumes (because of rebooting) migrate to other cluster nodes and become functionable again.
After some hours another server shows the same symptoms. It seems to be that there are one or two "favorite" cluster volumes that are always involved.
Additional information:
=======================
1.)
I applied TD 7012793 to one cluster node. The only change: When the cluster volumes become unavailable to clients, the command
# rcndsd status returns no error in this case. But when the cluster volume is migrated (by iManager) the ndsd of the server from wich is migrated crashes with "dead" as the return value of the rcndsd status command.
2.)
A piece of /var/log/messages
I migrated by iManager the cluster volumen C3-NL3K12P-SERVER that became unavailable for clients from the server nc308
Sep 25 06:13:01 nc308 /usr/sbin/cron[22602]: (root) CMD (/usr/sbin/smt-agent)
Sep 25 06:14:48 nc308 [XTCOM]: pam_sm_authenticate in pam_ncl.c (novell-client's pam)is called
Sep 25 06:15:01 nc308 /usr/sbin/cron[22639]: (root) CMD ( /opt/hp/hp-health/bin/check-for-restart-requests)
Sep 25 06:16:15 nc308 sshd[22665]: Accepted keyboard-interactive/pam for root from 172.20.144.40 port 58548 ssh2
Sep 25 06:19:28 nc308 smdrd[16219]: Received Leave Event for C3-NL3K12P-SERVER
Sep 25 06:19:28 nc308 smdrd[16219]: Target name C3-NL3K12P-SERVER successfully de-advertised from SLP
Sep 25 06:19:28 nc308 kernel: [54445.897985] ndsd[22110]: segfault at 58 ip 00007fb6b44962b9 sp 00007fb69cec1be0 error 4 in libncpengine.so.0.0.0[7fb6b4429000+105000]
Sep 25 06:19:29 nc308 smdrd[16219]: Could not start TCP listener on 172.20.144.50
Sep 25 06:19:32 nc308 adminus daemon: umounting volume NL3K12S lazy=1
Sep 25 06:19:34 nc308 kernel: [54451.742301] NSSLOG ==> [MSAP] comnLog[201]
Sep 25 06:19:34 nc308 kernel: [54451.742303] Pool "NL3K12P" - MSAP deactivate.
Sep 25 06:20:01 nc308 /usr/sbin/cron[22848]: (root) CMD ( /opt/hp/hp-health/bin/check-for-restart-requests)
Sep 25 06:21:50 nc308 shutdown[22906]: shutting down for system reboot
Sep 25 06:21:51 nc308 init: Switching to runlevel: 6
Sep 25 06:21:53 nc308 kernel: [54591.102010] bootsplash: status on console 0 changed to on
Sep 25 06:21:57 nc308 multipathd: 36001438012599fc20000400000c40000: stop event checker thread (140680465872640)