Very frequently we see SPU0401 lose heartbeat for couple of seconds. Below message is written in sys manager log on frequent basis.
2014-12-09 01:24:16.796875 EST Info: [spu hwid=1232 sn=”Y011UN19G131″ SPA=4 Parent=1004 Position=3 spuName= spu0401] got a heartbeat after 16 seconds, last heartbeat: Dec 09 01:22:52 2014
2014-12-09 11:04:49.742910 EST Info: [spu hwid=1232 sn=”Y011UN19G131″ SPA=4 Parent=1004 Position=3 spuName= spu0401] got a heartbeat after 24 seconds, last heartbeat: Dec 09 11:03:16 2014
2014-12-09 11:04:49.742974 EST Info: [spu hwid=1232 sn=”Y011UN19G131″ SPA=4 Parent=1004 Position=3 spuName= spu0401] missed 3 heartbeat messages – seq: 349669, last: 349666
This information written in the sysmgr is information only and is not a warning.
Since, SPUs and sysmgr are sending ACK/reply messages continuously. The log message means that the blade (mentioned in log) was under high work load at that time and was too busy to respond to these messages. So, in this scenario sys manager is just logging this for information only and this is normal behavior. If a SPU did not answer within 900 sec (base on system registry setting sysmgr.spupollreplywarninginterval), then sysmgr will log this as a “warning” but still not an issue, after the warning the system will continue to send back and count until another setting (spupollreplytimeout) is hit. If it hit 1800 sec, then sysmgr will send a signal to the blade to restart since it doesn’t know the status of the blade.
To find the sysetm level setting on your system :
[nz@Host1]$ nzsystem showregistry | grep -i spupollreply
sysmgr.spuPollReplyWarningInterval = 900
sysmgr.spuPollReplyTimeout = 1800