IJ44909 |
High Importance
|
afmRecoveryVer2 cannot trigger policy scan on the remote contact node in NSD backend AFM fileset. Policy scan code is disabled to be run on NSD backend filesets.
(show details)
Symptom |
Unexpected Results |
Environment |
Linux |
Trigger |
Trying to run afmRecoveryVer2 on NSD backend AFM fileset. |
Workaround |
Do not set the afmRecoveryVer2 tunable on NSD backend afm fileset. |
|
5.1.6.1 |
AFM |
IJ44890 |
High Importance
|
Signal 11 happens when any dependent fileset is attempted to create under an AFM HPT independent fileset.
(show details)
Symptom |
Abend/Crash |
Environment |
ALL |
Trigger |
Creation of dependent fileset under an AFM HPT independent fileset |
Workaround |
None |
|
5.1.6.1 |
AFM (HPT) |
IJ44887 |
High Importance
|
mmfsd daemon assert going off: logAssertFailed: dataBlockNum != lastDataBlock || !newDA.isALLOC() || newDA.getNSubblocks() == inode.getLastBlockSubblocks(isWideDAFS), resulting mmfsd daemon process crash.
(show details)
Symptom |
Abend/Crash |
Environment |
ALL |
Trigger |
Parallel writes to the same file and one write is updating to the last data block. |
Workaround |
None |
|
5.1.6.1 |
All Scale Users |
IJ44867 |
High Importance
|
Kernel lockup due to dentries being added to lookup cache before complete initialization
(show details)
Symptom |
Kernel deadlock |
Environment |
Linux |
Trigger |
In order for the race condition to occur, multiple threads must be looking up the same file which does not exist, although this scenario does not guarantee that it will be reproduced. |
Workaround |
Disabling deferred negative dentry invalidation with mmchconfig deferNegativeDcacheInvalidation=0 --force
|
|
5.1.6.1 |
All Scale Users |
IJ44857 |
Critical |
cp command fails to copy data from AFM uncached file on RHEL 9.1 because the command tries to get data using lseek (SEEK_DATA) which fails on AFM uncached files.
(show details)
Symptom |
Unexpected results |
Environment |
ALL |
Trigger |
Usage of cp command to copy the AFM uncached files on RHEL 9.1 |
Workaround |
Use dd command or any other command which does not seek data section to copy the data. |
|
5.1.6.1 |
AFM |
IJ44856 |
Suggested |
Lookup operation found create as dependent and pushing create operations to be completed and it caused deadlock because lookup has already acquired mutex on the file and create tried to do stat on the same file.
(show details)
Symptom |
waiters |
Environment |
Linux |
Trigger |
waiters are seen and fileset is stuck to show progress |
Workaround |
None |
|
5.1.6.1 |
AFM-COS |
IJ44839 |
Suggested |
A node (kernel) crash can occur when the vinfoLockOnWrite config option is enabled.
(show details)
Symptom |
Crash |
Environment |
ALL |
Trigger |
Timing hole when enabling the undocumented config option vinfoLockOnWrite, likely triggered by using snapshots |
Workaround |
Avoided by not enabling the undocumented vinfoLockOnWrite config option |
|
5.1.6.1 |
Core GPFS |
IJ44838 |
High Importance
|
The special .afmctl file at home/secondary loses its Control attribute and is treated as a normal file. This returns a buffer of expected 2048 size - overflowing the 1100 buffer given for this at cache - expecting a CTL file treatment at the home/secondary
(show details)
Symptom |
Crash |
Environment |
Linux (AFM Gateway nodes) |
Trigger |
Invalid .afmctl control file at home. |
Workaround |
Manually disable and re-enable mmafmconfig at the home/secondary and then stop/start the cache fileset to pickup the new changes from home. |
|
5.1.6.1 |
AFM |
IJ44837 |
Suggested |
After the mmrestorefs command (and mmafmctl commands that internally calls file system or fileset restore functionality) that is used to restore a file system or an independent fileset completes successfully, at times, a segmentation fault error from the tsapolicy process can be observed if the cipherList configuration variable is set to AUTHONLY or real cipher value.
(show details)
Symptom |
Error output/message
After executing the mmrestorefs command, system error log (dmesg in Linux or errpt in AIX) may display messages like these:
"[16342239.952383] tsapolicy[1627940]: segfault at a ip 00007f2fda7915f5 sp 00007ffe171250c8 error 4 in libc-2.28.so[7f2fda634000+1bc000]" |
Environment |
AIX, Linux |
Trigger |
Run mmrestorefs with multiple nodes when the cipherList configuration variable is set to AUTHONLY or real cipher value |
Workaround |
1) This problem can be ignored because it has no impact to the mmrestorefs functionality
2) Run mmrestorefs with -N |
|
5.1.6.1 |
mmrestorefs |
IJ44836 |
High Importance
|
After GPFS 5.1.2 release, on some token manager node, the memory from token management subpool may be leaked.
This can be observed from output of mmfsadm dump malloc:
Statistics for MemoryPool id 3 ("UNPINNED_TM") at 0xF1000012C00246C8:
...
Memory subpool 'HolderList' at 0xF1000012C00258B0
objSize 16 spObjectsPerChunk 65536 expandInProgress 0
inUse 140052583 free 63385 total 140115968 limit 2147483647
the "inUse" filed is increased gradually.
(show details)
Symptom |
Out-of-memory, Unexpected Results/Behavior |
Environment |
ALL |
Trigger |
During token management, one type of object is missed freed when the token is destroyed. |
Workaround |
None |
|
5.1.6.1 |
All Scale Users |
IJ44832 |
Suggested |
The mmwatch plugin to mmhealth can print or log excess error messages if there is a filesystem that is offline for some reason.
(show details)
Symptom |
Error output/message |
Environment |
Linux |
Trigger |
Running mmhealth when there is an unmountable filesystem defined. |
Workaround |
The mmwatch plugin to mmhealth can be disabled. |
|
5.1.6.1 |
Admin Commands |
IJ44678 |
High Importance
|
Remote error 2 while replicating Link operation if parent directory is deleted before replicating create/link operation.
(show details)
Symptom |
AFM Queue drop and Fileset goes to resync state. |
Environment |
Linux |
Trigger |
Create/Link/Parent dir remove operation in queue with Fast Create config option enabled. |
Workaround |
None |
|
5.1.6.1 |
AFM |
IJ44629 |
Critical |
Due to a race condition between the RDMA software layer and IBM Spectrum Scale, it is possible that an application running on an IBM Spectrum Scale client may read incorrect data from files stored on GPFS under certain conditions.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
Race condition between the RDMA software layer and IBM Spectrum Scale. |
Workaround |
Disable RDMA. |
|
5.1.6.1 |
RDMA |
IJ44611 |
Suggested |
In GPFS backend, cleanup took the handlerListLockExclusive on SGPanic and at the same time, handler is trying to setup (setupctl) the fileset mount path by using HandlerMutex and this is waiting for SG cleanup.
(show details)
Symptom |
Long waiters |
Environment |
Linux |
Trigger |
waiters will be seen and fileset is stuck to show progress. |
Workaround |
None |
|
5.1.6.1 |
AFM with GPFS backed |
IJ44607 |
High Importance
|
GNR RPCs fail when received by a GPFS daemon 5.1.3 or later from a GPFS daemon older than version 5.1.3. Kernel assert going off: privVfsP != NULL
(show details)
Symptom |
Hang in the command |
Environment |
ALL |
Trigger |
Any GNR-related command. |
Workaround |
None |
|
5.1.6.1 |
GNR |
IJ44574 |
Suggested |
afmRecoveryVer2 code needs the latest 5.1.6 release to be present at both cache/primary and the home/secondary. We have code to check if the home/secondary supports afmRecoveryVer2 but it fails to have effect and results in an error 121 when recovery is run against that home/secondary.
(show details)
Symptom |
Unexpected Behavior |
Environment |
Linux Only (acting as AFM Gateway nodes) |
Trigger |
Running afmRecoveryVer2 from cache with ahome/secondary site which doesn't support recoveryVer2 yet. |
Workaround |
Make sure the home/Secondary is running an afmRecoveryVer2 compatible version too when enabled at the cache/primary. |
|
5.1.6.1 |
AFM |
IJ44492 |
High Importance
|
GPFS daemon could fail unexpectedly with assert: regP->owner!=fromNode,in allocM.C. This could happen as a result of file system unmounted on a node due to error.
(show details)
Symptom |
Abend/Crash |
Environment |
ALL |
Trigger |
File system unmounted due to error |
Workaround |
Disable the assert via disableAssert configuration |
|
5.1.6.1 |
All Scale Users |
IJ44491 |
High Importance
|
Enable/disable ptrash local bit setting code through afmRevalOpWaitTimeout configurable.
(show details)
Symptom |
Unexpected Behavior |
Environment |
Linux (serving as AFM Gateway nodes) |
Trigger |
afmRevalOpWaitTimeout being set to a non-default value causing ptrash local bit setting code to not take effect. |
Workaround |
Setting the afmRevalOpWaitTimeout to its default value of 180 will ensure ptrash is set to local |
|
5.1.6.1 |
AFM |
IJ44489 |
Suggested |
4U102 IOM failure currently not calling home (MAPS), but it should do it.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
N/A |
Workaround |
Customer must create a ticket manually. |
|
5.1.6.1 |
ESS/GNR |
IJ44459 |
Suggested |
Updates to CLUSTER_PERF_SENSOR_CANDIDATES were not taken into account and certain failure scenarios did not trigger a CLUSTER_PERF_SENSOR failover.
(show details)
Symptom |
Component Level Outage |
Environment |
ALL |
Trigger |
Using CLUSTER_PERF_SENSOR_CANDIDATES nodeclass to control which nodes are considered to get that role and then updating that nodeclass. Failover scenario like mmshutdown on the current CLUSTER_PERF_SENSOR node. |
Workaround |
A "mmsysmoncotrol restart" does also force updates and pending failover actions |
|
5.1.6.1 |
perfmon (Zimon) |
IJ44441 |
High Importance
|
Signal 11 happens on mmfsd process if there's one file system with original version <= 11.01 and it's upgraded to the latest one, resulting in mmfsd daemon crash and the file system becomes inaccessible.
(show details)
Symptom |
Abend/Crash |
Environment |
ALL |
Trigger |
Upgrade file system with original version <= 11.01 to the latest version. |
Workaround |
None |
|
5.1.6.1 |
All Scale Users |
IJ44440 |
High Importance
|
We found the NVMe and SSD disks were put into one DA when RG creation. The disk size is almost the same, and both have spin = 0, then GNR thinks it's the same type of disk.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
NVMe and SSD disks have same size. |
Workaround |
None |
|
5.1.6.1 |
ESS/GNR |
IJ44322 |
High Importance
|
When there are more than 64 IP addresses on the node, assert goes off when daemon starts up.
(show details)
Symptom |
Abend/Crash |
Environment |
ALL |
Trigger |
Having more than 64 IP addresses on the node |
Workaround |
Remove some IP addresses from the node |
|
5.1.6.1 |
All Scale Users |
IJ44067 |
Suggested |
System monitoring collects all information about a cluster by sending it to relevant nodes. It ignores cluster boundaries while doing so which does not work and creates spurious error messages in the logs.
(show details)
Symptom |
Unexpected Results/Behavior, Erroneous Log entries |
Environment |
Linux |
Trigger |
Setup with remote cluster integration |
Workaround |
Ensure the nodes in the home cluster are at a level >= code level of the remote clusters |
|
5.1.6.1 |
System Health (mmfs.log.latest) |
IJ43790 |
High Importance |
Commands like mmcrcluster or mmaddnode may hang in GSKIT layer on AMD EPYC family 25 processors. A particular model from family 25 that is known to hang in GSKIT layer is AMD EPYC 7343.
(show details)
Symptom |
Admin commands hangs |
Environment |
Linux |
Trigger |
This problem affects AMD EPYC family 25 processors. |
Workaround |
Add "ICC_SHIFT=3" line in /usr/lpp/mmfs/lib/gsk8/Cicc/icclib/ICCSIG.txt file on problem nodes. |
|
5.1.6.1 |
Admin Commands, gskit |
IJ44219 |
Suggested |
Files not replicated on create after failoverToSecondary.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
After failovertosecondary, if you create and write files and then changesecondary to sync with old primary. |
Workaround |
None |
|
5.1.6.0 |
AFM-DRPFS |