Netezza TwinFin SAS switch restart error message
You may get fake SAS switch restart message from your NPS TwinFin system like:
Message Header Host : MySERVER-1a. Event : Hardware Restarted. Event Rule Detail : . Start : 08-21-13 14:19:27 EDT. Reporting Interval : 2 minutes. Activity Duration : 00:00:57. Number of events : 2.
1 hwType= Sas Switch, hwId=1008, spaId=5, spaSlot=1, eventSource=system, devSerial=, devHwRev=, devFwRev=, eventSource=System initiated, Hardware Restarted on 08-21-13 14:19:32 EDT.
2 hwType= Sas Switch, hwId=1007, spaId=4, spaSlot=2, eventSource=system, devSerial=, devHwRev=, devFwRev=, eventSource=System initiated, Hardware Restarted on 08-21-13 14:20:29 EDT.
These error messages will be generated for almost all the SAS switches on the server.
About SAS Switch, we have 2 SAS switch per chassis. So for TwinFIn 6 we have 1 chassis i.e. 2 SAS switch. Similarly for TwinFin24 we will have 4 Chassis i.e. 8 SAS Switch.
So, based on your server you will get that many SAS Switch restart errors. Reason being, SAS switch has a internal counter within the switch which grows from 0, 1 per day until 497 days and then it rolls over to 0. When it rolls over to 0, system thinks SAS switch has restarted and starts generating these fake restart messages.
This bug is their in NPS version 6.* and older running on FDT version 2.4* and older. The bug was fixed in NPS Version 7.0 on FDT 2.5
More info on this bug:
Basically there is an internal counter within the switch itself. The sysmgr queries the switch and stores the up-time from the counter. It then compares it with the previously stored up-time. There is a bug in the counter that at 497 days the sas switch rolls over to 0. When the sysmgr compares the times, it thinks the switch restarted. The sas switch did not restart, just the switch rolling to 0.
Point to note here is that SAS switch counter is not reset when we start and stop NPS. It will only reset when we completely shutdown our machine and bring it up again or if SAS Switch is replaced.
NOTE: In case of actual failure also you can get these error messages.
How to identify if SAS Switch has restarted in reality?
Answer: When SAS Switch restarts your NPS system will also get restarted and when you run ‘nzstate’ it should return message other than Online (like discovering, starting, etc) and also you can check if system got restarted by visiting /nz/kit/log/startupsvr
Run ‘nzstats’ to check the uptime. Based on that also you can check if NPS restarted or not.
Also, you can run ‘nzhs -issues’ to check for hardware issues and ‘nzds -issues’ to check for data slice issues.
How to check current counter value?
For that snmp command needs to be run on your server. Just run ‘snmpwalk -Os -c public -v 1 sassw01a’ on your server and check for value under sysUpTimeInstance. This command will find the current counter value for SAS Switch 1a.
-bash-3.2$ snmpwalk -Os -c public -v 1 sassw01a
sysDescr.0 = STRING: IBM BladeCenter SAS Connectivity Module
sysObjectID.0 = OID: enterprises.2.3.111
truncating unsigned value to 32 bits (2)
sysUpTimeInstance = Timeticks: (3506813348) 405 days, 21:08:53.48
sysContact.0 = STRING: Unassigned
sysName.0 = STRING: IBM-SAS-MODULE
sysLocation.0 = STRING: Unassigned
sysServices.0 = INTEGER: 8
Similarly to check the counter value for switch 1b you have to run ‘snmpwalk -Os -c public -v 1 sassw01b’