Skip to content

Posts tagged ‘spu’

SPU losing heartbeat very frequently – Netezza

ISSUE:

Very frequently we see SPU0401 lose heartbeat for couple of seconds. Below message is written in sys manager log on frequent basis.

2014-12-09 01:24:16.796875 EST Info: [spu hwid=1232 sn=”Y011UN19G131″ SPA=4 Parent=1004 Position=3 spuName= spu0401] got a heartbeat after 16 seconds, last heartbeat: Dec 09 01:22:52 2014
2014-12-09 11:04:49.742910 EST Info: [spu hwid=1232 sn=”Y011UN19G131″ SPA=4 Parent=1004 Position=3 spuName= spu0401] got a heartbeat after 24 seconds, last heartbeat: Dec 09 11:03:16 2014
2014-12-09 11:04:49.742974 EST Info: [spu hwid=1232 sn=”Y011UN19G131″ SPA=4 Parent=1004 Position=3 spuName= spu0401] missed 3 heartbeat messages – seq: 349669, last: 349666

RESOLUTION:

This information written in the sysmgr is information only and is not a warning.
Since, SPUs and sysmgr are sending ACK/reply messages continuously. The log message means that the blade (mentioned in log) was under high work load at that time and was too busy to respond to these messages. So, in this scenario sys manager is just logging this for information only and this is normal behavior. If a SPU did not answer within 900 sec (base on system registry setting sysmgr.spupollreplywarninginterval), then sysmgr will log this as a “warning” but still not an issue, after the warning the system will continue to send back and count until another setting (spupollreplytimeout) is hit. If it hit 1800 sec, then sysmgr will send a signal to the blade to restart since it doesn’t know the status of the blade.

To find the sysetm level setting on your system :
[nz@Host1]$ nzsystem showregistry | grep -i spupollreply
sysmgr.spuPollReplyWarningInterval = 900
sysmgr.spuPollReplyTimeout = 1800

Mustang – How to troubleshoot bad SPU

Here is the procedure to diagnose and find the bad SPU in Mustang box which may be generating errors for some queries.

1) Note the hwid reported in the error

2) cd to /nz/kit/log/postgres and grep for the string “ERROR\:” in pg.log. If you see errors that look like:

2014-09-10 09:31:33.443100 EDT [27619] ERROR: 23 : spu 10.0.32.3 disk error DISK_SATA_RX_ERROR at 12343123

This indicates that these disk errors are causing queries to fail.

3) Confirm that the error is related to the SPU is using the hwid from the alert:

nz@host1:/nz/kit/log/postgres->nzinventory | grep 1962
SPU 1962 Active Online 41 4 10.0.41.4 372.61 GB

Now we know that this SPU will continue to cause queries to fail.

4) Issue the command:

nz@host1:/export/home/nz-> nzsystem pause

Default timeout is 5 minutes. So, it will wait for five minutes to wait for queries in-flight to complete. If in-flight queries don’t complete in this time, NPS will kill active queries and then pause NPS.

To override the default timeout of 300 seconds, you can issue the -timeout flag eg.

nz@host1:/export/home/nz-> nzsystem pause -timeout 600

5) Now that the system has paused, fail over the bad SPU:

nz@host1:/export/home/nz->nzspu failover -id 1962

6) Resume NPS:

nz@host1:/export/home/nz->nzsystem resume

7) Monitor the regen with the command:

nz@host1:/export/home/nz->watch nzinventory -type regenTasks

NOTE: Regen  is much faster on mustang systems than on TwinFins. In Mustang, it will go through a synchronization process to apply transactions for that SPU that were submitted while the SPU was being reigned.  The busier the system is, the longer the synchronization process will take.