Skip to content

Mustang – How to troubleshoot bad SPU

Here is the procedure to diagnose and find the bad SPU in Mustang box which may be generating errors for some queries.

1) Note the hwid reported in the error

2) cd to /nz/kit/log/postgres and grep for the string “ERROR\:” in pg.log. If you see errors that look like:

2014-09-10 09:31:33.443100 EDT [27619] ERROR: 23 : spu disk error DISK_SATA_RX_ERROR at 12343123

This indicates that these disk errors are causing queries to fail.

3) Confirm that the error is related to the SPU is using the hwid from the alert:

nz@host1:/nz/kit/log/postgres->nzinventory | grep 1962
SPU 1962 Active Online 41 4 372.61 GB

Now we know that this SPU will continue to cause queries to fail.

4) Issue the command:

nz@host1:/export/home/nz-> nzsystem pause

Default timeout is 5 minutes. So, it will wait for five minutes to wait for queries in-flight to complete. If in-flight queries don’t complete in this time, NPS will kill active queries and then pause NPS.

To override the default timeout of 300 seconds, you can issue the -timeout flag eg.

nz@host1:/export/home/nz-> nzsystem pause -timeout 600

5) Now that the system has paused, fail over the bad SPU:

nz@host1:/export/home/nz->nzspu failover -id 1962

6) Resume NPS:

nz@host1:/export/home/nz->nzsystem resume

7) Monitor the regen with the command:

nz@host1:/export/home/nz->watch nzinventory -type regenTasks

NOTE: Regen  is much faster on mustang systems than on TwinFins. In Mustang, it will go through a synchronization process to apply transactions for that SPU that were submitted while the SPU was being reigned.  The busier the system is, the longer the synchronization process will take.