IJ49371 |
High Importance
|
When network is poor, we may hit this assertion when TCP connection is connected or re-connected
(show details)
Symptom |
Abend/Crash |
Environment |
ALL Operating System environments |
Trigger |
Network is not good which leads to TCP connection reconnect |
Workaround |
No |
|
5.1.2.15 |
All Scale Users |
IJ49372 |
Suggested |
When running a workload on Windows which creates and deletes lots of files anddirectories in a short span, the inode number assigned for GPFS objects may bereused. If a stale inode entry somehow persists in the GPFS cache due to in flighthold counts, it can happen that due to conflict between the old and new objecttypes, this stale entry will result in a file or directory not found error.
(show details)
Symptom |
Unexpected Results/Behavior. |
Environment |
Windows/x86_64 only. |
Trigger |
Running a workload on Windows which continuously creates and deletes lots of files and directories quickly. |
Workaround |
None |
|
5.1.2.15 |
All Scale Users. |
IJ49543 |
High Importance
|
Spectrum Scale Erasure code edition interacts with third party software/hardware APIs for internal disk enclosure management. If the management interface becomes degraded and starts to hang commands in the kernel, the hang may also block communication handling threads.This causes a node to fail to renew its lease, causing it to be fenced off from the rest of the cluster. This may lead to additional outages. A previous APAR was issued for this in 5.1.4, but that fix was incomplete.
(show details)
Symptom |
Hang/Deadlock/Unresponsiveness/Long Waiters |
Environment |
Linux Only |
Trigger |
Degradation in back-end storage management that causes commands to hang in the kernel. |
Workaround |
The node with hardware problems will show waiters 'Until NSPDServer discovery completes. 'It is recommended to reboot those nodes with those GPFS waiters exceeding 2 minutes if this node is also being expelled. |
|
5.1.2.15 |
ESS/GNR |
IJ49373 |
High Importance
|
The daemon assert going off: fromNode != regP->owner in file allocM.C, which then resulted in daemon crashed.
(show details)
Symptom |
Daemon crash |
Environment |
All Operating Systems |
Trigger |
mmdefragfs or mmdf command is running while there is node failures or less free space in the file system. |
Workaround |
No |
|
5.1.2.15 |
All Scale Users |
IJ49542 |
High Importance
|
pmsensor GPFSVFSX output 0 read and write stats but there are read/write operations, the problem here is that the format of data provided by mmpmon is not expected by Zimon, which caused the output to be wrong.
(show details)
Symptom |
Error output/message |
Environment |
ALL Operating System environments |
Trigger |
read GPFSVFSX stats |
Workaround |
None |
|
5.1.2.15 |
perfmon (Zimon) |
IJ49650 |
High Importance
|
Today fallocate is prevented on AFM caching modes because there is no guarantee that afmctl file is present on this mode and so can't take the chance to support it.
(show details)
Symptom |
Error output |
Environment |
ALL Operating System environments |
Trigger |
Perform fallocate on AFM caching mode filesets (SW/IW) |
Workaround |
None |
|
5.1.2.15 |
AFM |
IJ49472 |
Critical |
Snapshot creation cannot be done due to a background file deletion is running into infinite loop on a corrupted compression file.
(show details)
Symptom |
deadlock |
Environment |
All Operating Systems |
Trigger |
Corrupted compression file. |
Workaround |
None |
|
5.1.2.15 |
Compression |
IJ49473 |
Suggested |
Mmbackup invokes tslssnapshot command multiple times during snapshot backup. It has small performance impact if the file system has large number of snapshots.
(show details)
Symptom |
Performance Impact/Degradation |
Environment |
all platforms that support mmbackup. |
Trigger |
This problem could occur if snapshot backup is executed for the file system that has lots of snapshots. |
Workaround |
none |
|
5.1.2.15 |
mmbackup |
IJ49825 |
Suggested |
Sometimes, the system monitor may report a warning message: 'statd_multiple WARNING The rpc.statd process is running multiple times.'This is due to a forked short-lived process from the 'statd' process.
(show details)
Symptom |
sysmon may report following warning message. 'statd_multiple WARNING The rpc.statd process is running multiple times.' |
Environment |
All Linux |
Trigger |
This might happen if the 'statd' creates a fork process, and at the same time, sysmon checks for the 'statd' process. The 'statd' forked process is a short-lived process; hence, the forked process should not be counted. |
Workaround |
NA |
|
5.1.2.15 |
System Health |
IJ49540 |
Suggested |
When a RDMA connection to a remote node has to be shutdown due to network errors (e.g. network link goes down) it can sometimes happen that the affected RDMA connection will not be closed and all resources assigned to this RDMA connection (memory, VERBS Queue Pair, ...) are not freed.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
ALL Linux OS environments |
Trigger |
verbsRdmaSend must be enabled. Loss of a RDMA connection to a node because of network errors in the RDMA fabric. |
Workaround |
No work around available |
|
5.1.2.15 |
RDMA |
IJ49541 |
High Importance
|
Ganesha crashes will cause "health" alerts. The "/var/log/ganesha.log" will contain a crash backtrace that will look like free_client_id :RW LOCK :CRIT :Error 16, Destroy mutex 0x3fff6c2fedd0 (&clientid->cid_mutex) at nfs-ganesha-3.5-ibm071.22/SAL/nfs4_clientid.c:348 It contains "Error 16" and source code reference "nfs-ganesha-3.5-ibm071.22/SAL/nfs4_clientid.c:348"
(show details)
Symptom |
Ganesha Crash |
Environment |
Linux Only |
Trigger |
The problem mostly occurs if there is a delay in processing NFSv4 client's renew request due to a resource crunch. |
Workaround |
None |
|
5.1.2.15 |
NFS |
IJ49648 |
High Importance
|
Files are not re-validated in AFM cascading relationship because of readdir optimization. This happens if the home fileset is AFM enabled with COS backend.
(show details)
Symptom |
Unexpected Results |
Environment |
All Linux OS environments |
Trigger |
AFM cascading relationship with AFM+COS fileset. |
Workaround |
None |
|
5.1.2.15 |
AFM |
IJ49649 |
High Importance
|
GPFS daemon could fail unexpectedly with assert: Assert exp (nPrefetchedBuffers > 0). This could happen when DIO is used to append to a file.
(show details)
Symptom |
Abend/Crash |
Environment |
ALL Operating System environments |
Trigger |
Append to a file using DIO. |
Workaround |
Set dioReentryThreshold configuration variable to 2 |
|
5.1.2.15 |
All Scale Users |
IJ49826 |
High Importance
|
Kernel crash when executing programs that calls gpfs_ireadx() interface on DMAPI disabled file systems (e.g., mmrestorefs, or using "cp" or "tar" command that supports to detect sparse holes in source files with lseek(2) interface).
(show details)
Symptom |
Abend/Crash |
Environment |
Linux Only |
Trigger |
Executing programs that calls gpfs_ireadx() interface on DMAPI disabled file systems (e.g., mmrestorefs, or using "cp" or "tar" command that supports to detect sparse holes in source files with lseek(2) interface). |
Workaround |
None |
|
5.1.2.15 |
All Scale Users |
IJ48661 |
High Importance
|
When mmbackup generates policy rule to select backup candidates, it composites pathname list using snapshot name in snapshot backup case. In fileset backup, mmbackup treats snapshot as fileset snapshot unconditionally even though the snapshot is global snapshot. Hence, generated pathname is incorrect.
(show details)
Symptom |
Component Level Outage |
Environment |
All platforms that support mmbackup |
Trigger |
Run fileset backup using global snapshot |
Workaround |
Use fileset snapshot for fileset backup |
|
5.1.2.14 |
mmbackup |
IJ46155 |
Suggested |
File creation or close is pending on the thread CloseHandlerThread with "waiting for dealloc queue flush" long waiter.
(show details)
Symptom |
The small files creations are pending on closes, then the performance of files creations is slowing down. |
Environment |
All Operating Systems |
Trigger |
Lots of file creations and closes while there are many other process doing space deallocations. |
Workaround |
None |
|
5.1.2.14 |
All Scale Users |
IJ48826 |
High Importance
|
Prefetch, Recovery using list-files is using ftell on the open FILE pointer to get size of file and since this is 32 Bit in nature - it can end up getting junk value based on which the file split for threads in processing these happen.
(show details)
Symptom |
Prefetch fails to process the list file properly and is seen looping around with a smaller subset. |
Environment |
All Linux OS environments (AFM Gateway nodes) |
Trigger |
Running prefetch with a single large list file which is > 2GB in size. |
Workaround |
Split single large list file of > 2GB into smaller lists of < 2GB each and use for prefetch. |
|
5.1.2.14 |
AFM |
IJ48827 |
High Importance
|
AFM replication fails with error 22 if the remote file mode is symlink during the write or create operation.
(show details)
Symptom |
Unexpected results |
Environment |
All Linux OS environments |
Trigger |
AFM cache conflict |
Workaround |
None |
|
5.1.2.14 |
AFM |
IJ49085 |
Critical |
AFM resync fails with error 9 and queue will get stuck.
(show details)
Symptom |
Unexpected results |
Environment |
All Linux OS environments |
Trigger |
AFM resync |
Workaround |
None |
|
5.1.2.14 |
AFM and AFM DR |
IJ49086 |
High Importance
|
File create performance could degrade when concurrently create many small files in many directories due to mutex contention. This would also lead to higher CPU usage by GPFS daemon.
(show details)
Symptom |
Performance Impact/Degradation |
Environment |
ALL Operating System environments |
Trigger |
Concurrent create of many small files in many directories |
Workaround |
Set maxInodeDeallocHistory configuration variable to 0 |
|
5.1.2.14 |
All Scale Users |
IJ49087 |
Critical |
Deleting snapshots or accessing snapshot files may fail with 214 error code and also a FSSTRUCT errNo=1116 (FSErrSnapInodeModified) is logged in system log file
(show details)
Symptom |
Operation fails with 214 error code and FSSTRUCT errNo=1116 logged in system log file |
Environment |
All Operating Systems |
Trigger |
File system manager node fails when updating files with snapshot existing |
Workaround |
None |
|
5.1.2.14 |
Snapshot |
IJ49088 |
Suggested |
Sometimes the snapshot deletion could take longer time than the earlier snapshot deletions
(show details)
Symptom |
Slow snapshot deletion |
Environment |
All Operating Systems |
Trigger |
Snapshot deletion when LROC device is configured on a client node. |
Workaround |
None |
|
5.1.2.14 |
snapshot and LROC |
IJ48869 |
Critical |
File data loss when copying or archiving data from snapshot and clone files (e.g., using "cp" or "tar" command that supports to detect sparse holes in source files with lseek(2) interface).
(show details)
Symptom |
Data Loss |
Environment |
Linux Only |
Trigger |
Using the copy or archive tools that support to detect the sparse holes in the source file with the lseek(2) interface. |
Workaround |
Switch to use other copy or archive tools to copy or archive the data from snapshot and clone files. |
|
5.1.2.14 |
Snapshot and clone files |
IJ42454 |
Critical |
File data loss when copying or archiving data from migrated files (e.g., using "cp" or "tar" command that supports to detect sparse holes in source files with lseek(2) interface).
(show details)
Symptom |
Data Loss |
Environment |
Linux Only |
Trigger |
Using the copy or archive tools that support to detect the sparse holes in the source file with the lseek(2) interface. |
Workaround |
Switch to use other copy or archive tools to copy or archive the data from migrated files, or recall the file before using the copy or archive applications. |
|
5.1.2.14 |
DMAPI |
IJ48629 |
High Importance
|
Race between stat/gpfs_stalite() and inode token revoke causes log assert.
(show details)
Symptom |
Abend/Crash |
Environment |
ALL Operating Systems |
Trigger |
A file is actively written on one node and stat() or gpfs_statlite() is called repeatedly on another node |
Workaround |
Set config parameters statliteMaxAttrAge and statMaxAttrAge to 0 to disable stat lite. |
|
5.1.2.14 |
All Scale Users |
IJ48911 |
Critical |
The assert going off on "logAssertFailed: oldDA1Found[i].compAddr(synched1[I])", then result in mmfsd daemon crashed and finally could cause file system can't be mounted on any node.
(show details)
Symptom |
Abend/Crash |
Environment |
All Operating Systems |
Trigger |
Run fsck to fix the duplicated disk address on compressed files. |
Workaround |
None |
|
5.1.2.14 |
Compression |
IJ47843 |
High Importance
|
Kernel crash with assert: nPrefetchedBuffers > 0.
This could happen when application using multiple threads to perform sequential
read or write more than 65535 blocks on the same open file.
The starting offset of the read/write must not be on GPFS block boundary.
(show details)
Symptom |
Abend/Crash |
Environment |
ALL Operating System environments |
Trigger |
Performing sequential read/write on the same file using
multiple threads where starting offsets of each read/write is not on
GPFS block boundary. |
Workaround |
Close/reopen file before performing more than
65535 sequential read/write on the same file using multiple threads. |
|
5.1.2.13 |
All Scale Users |
IJ48032 |
High Importance
|
GNR daemon assert dpP->dpGetBlockDevice() == pdBlockDeviceRP
goes off in response to certain pdisk device state changes, which will bring down mmfsd.
This problem was introduced in GPFS 5.1.5.1, and impacts GNR systems running the following code levels:
- 5.1.2.4+
- 5.1.5.1+
- 5.1.6+
- 5.1.7.0 but not 5.1.7.1+ (5.1.7.1+ gets a workaround patch)
(show details)
Symptom |
Abend/Crash |
Environment |
Linux Only |
Trigger |
A condition occurs in which the pdisk device paths remain visible to
the Operating System, but something happens such that the pdisk no longer believes it
should be associated with the given block device that those paths represent.
The most common cause for this dissociation if the pdisk descriptor labels at the
earlier LBA become overwritten or corrupted. This kind ofcorruption is often the result
of hardware errors, but it can occur if some external process interferes and corrupts
the disk areas that are managed by GPFS and GNR.The dissociation step had a regression
from another fix, which causes the assert. Other conditions for the dissociation are
possible, but have not been properly identified as of the time of this fix. |
Workaround |
None |
|
5.1.2.13 |
ESS/GNR |
IJ48302 |
High Importance
|
In certain cases reference on block device is not released
due to which reference counter goes to large value and we cannot unload block device module
(show details)
Symptom |
Increase in reference count for block device |
Environment |
ALL Operating System environments |
Trigger |
When access to block device is made e.g while doing disk info,
reference counter leak is triggered |
Workaround |
None |
|
5.1.2.13 |
File system |
IJ48287 |
HIPER |
AFM fails to replicate the files with afmFastCreate option
if the newly created file is renamed to a different directory and it's original parent is deleted
(show details)
Symptom |
Unexpected results, file tree mismatch |
Environment |
All Linux OS environments |
Trigger |
Using afmFastCreate option to replicate data |
Workaround |
Disable afmFastCreate |
|
5.1.2.13 |
AFM |
IJ48288 |
HIPER |
Assert goes off when the temporary file is linked(created with O_TMPFILE and linkat)
and the inode data have to be evicted to accommodate AFM xattrs.
(show details)
Symptom |
Crash |
Environment |
All Linux OS environments |
Trigger |
Temporary file is linked(created with O_TMPFILE and linkat)
with data in inode on the AFM fileset |
Workaround |
None |
|
5.1.2.13 |
AFM |
IJ47005 |
High Importance
|
When mmbackup calculates number of backup candidates, it counts migrated files as backup candidates. It is incorrect unless --backup-migrated is used, because migrated files will not be backed up. This incorrect calculation results in mmbackup completion with error due to "some skipped files".
(show details)
Symptom |
Component Level Outage |
Environment |
All platforms that support mmbackup> |
Trigger |
This problem could occur by mmbackup without --backup-migrated option when some of files are migrated. |
Workaround |
None> |
|
5.1.2.12 |
mmbackup |
IJ46806 |
Suggested |
mmchfileset -t (use to set fileset comment) cannot handle null string
(show details)
Symptom |
- Error output/message
- Unexpected Results/Behavior |
Environment |
ALL Operating System environments |
Trigger |
mmchfileset fails to set null comment. |
Workaround |
Instead of a null string, you may want to use a single space as the input. |
|
5.1.2.12 |
Admin Commands |
IJ46804 |
Critical |
IBM Storage Scale will crash at startup if RDMA is enabled and the number of RDMA devices on a node exceeds 128.
(show details)
Symptom |
Abend/Crash |
Environment |
ALL Linux OS environments |
Trigger |
RDMA is enabled and the number of RDMA devices on a node exceeds 128. |
Workaround |
Disable RDMA or reduce the number RDMA devices to 128 or less. |
|
5.1.2.12 |
RDMA |
IJ47006 |
High Importance
|
When reconnect happens, we may encounter an error with errno 76, which indicates the connection is not connected, and results in LOGSHUTDOWN.
(show details)
Symptom |
Abend/Crash |
Environment |
ALL Operating System environments |
Trigger |
Network is not good which leads to TCP connection reconnect |
Workaround |
None |
|
5.1.2.12 |
All Scale Users |
IJ46805 |
High Importance
|
On SW mode AFM fileset - if a new directory is created at home and directory prefetch with this new directory is run with --force option, then the SW cache should be able to cache all data in this new dir. But since mmafmctl validates locally for this new directory (and since SW cannot fetch this new dir from home) - prefetch of new directory with --force results in no such dir error.
(show details)
Symptom |
Unexpected Behavior |
Environment |
All Linux and AIX OS environments. |
Trigger |
Performing prefetch on a new directory created at home on an SW mode AFM fileset at the cache. |
Workaround |
Perform mmafmlocal rstat ${dir} - before performing the --force directory prefetch on SW fileset. |
|
5.1.2.12 |
AFM |
IJ47007 |
Suggested |
If the name of a File System contains one or more underscore characters '_' and Clustered Watch Folder is enabled on said File System then events that are supposed to be delivered to the sink are never sent.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
ALL Operating System environments |
Trigger |
A File System Name that contains one or more underscore characters. |
Workaround |
Remove underscore chars from File Systems Names where Clustered System Watch is to be used. |
|
5.1.2.12 |
Watch Folder |
IJ47001 |
High Importance
|
Adding any disk into a file system, the ill_unbalanced flag would be set to indicate that the file system can be further rebalanced. With this ill_unbalanced flag, the mmhealth will see it and downgrade the file system until an mmrestripefs command -b option is done.
(show details)
Symptom |
mmhealth report ill_unbalanced_fs state. |
Environment |
All Operating Systems |
Trigger |
Adding descOnly disk to a Scale file system. |
Workaround |
None |
|
5.1.2.12 |
All Scale Users |
IJ47004 |
High Importance
|
Two client nodes are working on the same two regions for block deallocations and each client node owns one region of the two and doing the flush for the region it owns, meanwhile, the DeallocHelperThread on each client node is also requesting the ownership for the region owned by the other client node, then the revoke ownership request would be blocked on each other because the two regions are in flushing state but pending for ownership request from each other, thus forms a deadlock.
(show details)
Symptom |
Deadlock |
Environment |
All Operating Systems |
Trigger |
Users files data block deallocations from at least two different client nodes. |
Workaround |
Restart GPFS on the client node showing long waiter on allocMsgTypeRequestOwnership RPC message from DeallocHelperThread.
|
|
5.1.2.12 |
All Scale Users |
IJ47003 |
HIPER |
AFM gateway running on RHEL8.8 and RHEL 9.2 fails to perform full readdir operation at the cache which results in partially fetching the entries from the home.
(show details)
Symptom |
Unexpected results, data mismatch. |
Environment |
RHEL 8.8 and RHEL 9.2 Linux OS environments |
Trigger |
Upgrading to newer RHEL releases 8.8 and 9.2 |
Workaround |
Downgrade RHEL to earlier versions. |
|
5.1.2.12 |
AFM and AFM DR |
IJ46764 |
High Importance
|
Scale daemon assert going off: Assert exp(regP->isOwnerLocal() == 0) in file allocR.C, results in Scale mmfsd daemon process down.
(show details)
Symptom |
Abend/Crash |
Environment |
All Operating Systems |
Trigger |
Heavy block space allocation and deallocation in the cluster. |
Workaround |
None |
|
5.1.2.12 |
All Scale Users |
IJ45040 |
High Importance
|
GPFS daemon assert: exp(getDeEntType() == detUnlucky) in Direct.h. This could occur when there are concurrent access to the same directory with one node perform delete on a file while another node try to create the same file.
(show details)
Symptom |
Abend/Crash |
Environment |
ALL Operating System environments |
Trigger |
Concurrent delete/create of same file in a directory from multiple nodes. |
Workaround |
Avoid delete/create same file in a directory from multiple nodes at same time. |
|
5.1.2.12 |
All Scale Users |
IJ47220 |
High Importance
|
A race condition between the distributed GNR Disk hospital can cause a state update from the master node to a worker node to be rejected.
When the master node wishes to release a disk from the "diagnosing" to "ok" state, it sends a state broadcast to all worker nodes to instruct them to reflect the pdisk's new master state locally.
However, this broadcast can race with addition disk problem reports that are transmitted from the worker to the master.
The result is that the worker node can reject the master's claim that the disk is healthy, and continue holding the disk in diagnosing.
This can lead to blocked file system I/O unless another state change notification is broadcasted from the master, in which case the worker gets another change to resume I/O to the disk.
(show details)
Symptom |
Stuck IO |
Environment |
Linux Only |
Trigger |
This problem can potentially occur when any local I/O error is encountered on a pdisk, but in general the race condition in that path is rare. It is more likely to occur on Spectrum Storage Scale Erasure Code edition during periods of network instability when pdisks are likely to encounter many timeout errors. |
Workaround |
Restarting the daemon on the nodes with the waiter "Until disk availability stabilizes" can clear out the waiters. |
|
5.1.2.12 |
ESS/GNR |
IJ46382 |
Suggested |
Online replica compare function could incorrectly flag replica mismatch on certain metadata file such as symbolic link in an AFM enabled file system.
(show details)
Symptom |
Error output/message |
Environment |
ALL Operating System environments |
Trigger |
Run online replica compare function. |
Workaround |
Ignore replica mismatch on special metadata file such as link. |
|
5.1.2.12 |
AFM |
IJ47406 |
High Importance
|
Change to nsdRAIDDefaultIoTimeout is reset to default after gpfs restart
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
ALL Linux OS environments |
Trigger |
Restart gpfs daemon |
Workaround |
Use mmchconfig nsdRAIDDefaultIoTimeout=xxx -i after gpfs is restarted. |
|
5.1.2.12 |
ESS/GNR |
IJ47407 |
High Importance
|
The newer lscpu command lists CPU family after the Model name. This causes the code that detects and automatically applies a workaround for GSKIT hangs issue does not work as expected. Commands like mmcrcluster or mmaddnode may hang in GSKIT layer on AMD EPYC family 23 and 25 processors.
(show details)
Symptom |
Installation and admin commands hang. |
Environment |
Linux OS environments |
Trigger |
This problem affects AMD EPYC family 23 and 25 processors running with newer version of lscpu command. |
Workaround |
Add "ICC_SHIFT=3" line in /usr/lpp/mmfs/lib/gsk8/C/icc/icclib/ICCSIG.txt file on problem nodes. |
|
5.1.2.12 |
Admin Commands, gskit |
IJ47408 |
High Importance
|
AFM doesn't check the state of a message when dropping it using the \"mmfsadm afm msgdrop\" option. Its better to leave inflight messages be - and drop a message in any other state. Dropping inflight messages has a long term implication on the queue. It either hits a safety assertion or a Signal 11/6 somewhere to lose the queue.
(show details)
Symptom |
Crash |
Environment |
All Linux OS Environments (Acting as AFM Gateway nodes) |
Trigger |
Dropping a message in the AFM queue that is inflight. |
Workaround |
User has to carefully put queue into suspended state and then drop messages. |
|
5.1.2.12 |
AFM |
IJ47409 |
Medium Importance |
This APAR addresses two issues related to NFS-Ganesha that can cause crashes. Here are the details:
Issue 1:
NFS-Ganesha may crash with the following stack trace:
(gdb) bt
(gdb) bt
#0 0x00003fffa73e52e8 in raise () from /lib64/libpthread.so.0
#1 0x00003fffa7954628 in crash_handler (signo=6, info=0x3ffefac4a468, ctx=0x3ffefac496f0)
at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/MainNFSD/nfs_init.c:247
#2 <signal handler called>
#3 0x00003fffa717fcb0 in raise () from /lib64/libc.so.6
#4 0x00003fffa718200c in abort () from /lib64/libc.so.6
#5 0x00003fffa79b9fd4 in free_client_record (record=0x3fff200ed130) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/SAL/nfs4_clientid.c:1381
#6 0x00003fffa79ba3d8 in dec_client_record_ref (record=0x3fff200ed130) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/SAL/nfs4_clientid.c:1461
#7 0x00003fffa79b825c in nfs_client_id_expire (clientid=0x3fff200edbd0, make_stale=false)
at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/SAL/nfs4_clientid.c:914
#8 0x00003fffa79c7820 in reserve_lease_or_expire (clientid=0x3fff200edbd0, update=true)
at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/SAL/nfs4_lease.c:181
#9 0x00003fffa7a59db4 in nfs4_op_renew (op=0x3fff029152d0, data=0x3fff0320d9c0, resp=0x3ffee960cab0)
at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/Protocols/NFS/nfs4_op_renew.c:91
#10 0x00003fffa7a2ed80 in process_one_op (data=0x3fff0320d9c0, status=0x3ffefac4cfd0)
at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/Protocols/NFS/nfs4_Compound.c:920
#11 0x00003fffa7a30010 in nfs4_Compound (arg=0x3ffeeabd84a0, req=0x3ffeeabd7c90, res=0x3ffee9854f60)
at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/Protocols/NFS/nfs4_Compound.c:1327
#12 0x00003fffa794dae4 in nfs_rpc_process_request (reqdata=0x3ffeeabd7c90)
Issue 2:
NFS-Ganesha may crash with the following stack trace:
#0 0x00007f27f0a984fb in raise () from /lib64/libpthread.so.0
#1 0x00007f27f2775d7b in crash_handler (signo=11, info=0x7f20e337e930, ctx=0x7f20e337e800) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21/MainNFSD/nfs_init.c:247
#2 <signal handler called>
#3 0x00007f27f28a3cf5 in nlm_granted_callback (obj=0x7f2430001378, lock_entry=0x7f2204302c20) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21/Protocols/NLM/nlm_util.c:609
#4 0x00007f27f27b133b in try_to_grant_lock (lock_entry=0x7f2204302c20) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21/SAL/state_lock.c:1732
#5 0x00007f27f27b177b in process_blocked_lock_upcall (block_data=0x7f2204305510) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21/SAL/state_lock.c:1780
#6 0x00007f27f27ac19c in state_blocked_lock_caller (ctx=0x7f21c8408650) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21/SAL/state_async.c:81
#7 0x00007f27f27f62bd in fridgethr_start_routine (arg=0x7f21c8408650) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21/support/fridgethr.c:556
#8 0x00007f27f0a90ea5 in start_thread () from /lib64/libpthread.so.0
#9 0x00007f27f018fb0d in clone () from /lib64/libc.so.6
(show details)
Symptom |
Crash |
Environment |
Linux Only |
Trigger |
- For Issue 1, the crash is related to the NFSv4 lease period and can occur due to timing issues, such as delays in lease renewal or a heavily loaded server with multiple client requests.
- For Issue 2, the crash is related to blocking lock requests and lock upgrades on the same file by multiple threads, which can lead to timing issues. |
Workaround |
None |
|
5.1.2.12 |
NFS-Ganesha crash followed by CES-IP failover. |
IJ46628 |
Suggested |
AFM Recovery uses an external program to detect renames/removes done that were not replicated.
This external program was seen to leak few memory blocks which is now addressed.
(show details)
Symptom |
Unexpected Behavior |
Environment |
All Linux OS Plarforms (AFM Gateway nodes) |
Trigger |
AFM recovery triggered with renames/removes that need to be recovered. |
Workaround |
None |
|
5.1.2.11 |
AFM |
IJ46269 |
High Importance
|
Adding/Removing Gateway node roles to the cluster when Active I/O is
happening to an AFM fileset can cause deadlocks owing to how the node
join/leave protocol handles leading to One applicaiton node thinking of a certain
Gateway node to be the Gateway node for the fileset Vs other nodes thinking
other nodes to be fileset gateway nodes.
(show details)
Symptom |
Deadlock |
Environment |
ALL Operating System environments |
Trigger |
Running mmchnode --gateway/--nogateway when there is Active I/O happening on AFM filesets. |
Workaround |
Avoid running mmchnode --gateway/--nogateway when there is Active I/O happening on AFM filesets. |
|
5.1.2.11 |
AFM |
IJ46270 |
High Importance
|
A GPFS Windows node that has been running for a few hours, may enter a state where-in even under no load, the idle GPFS threads might spin causing 100% CPU utilization.
This is because of a potential error in time management and computation on Windows.
(show details)
Symptom |
Performance Impact/Degradation. |
Environment |
Windows/x86_64 only. |
Trigger |
GPFS must be up and running on a Windows node for a few hours. |
Workaround |
A possible work-around is to bounce GPFS on the Windows node (mmshutdown followed by mmstartup). |
|
5.1.2.11 |
Windows performance. |
IJ46271 |
High Importance
|
AFM Gateway node shall hit an assertion when running IO from application node
to a dependent fileset inside AFM independent fileset or AFM filesystem level
replication enabled.
(show details)
Symptom |
Crash |
Environment |
All Linux OS environments (AFM Gateway nodes) |
Trigger |
Running I/O to dependent fileset inside AFM independent fileset or to an AFM enabled Filesystem. |
Workaround |
None |
|
5.1.2.11 |
AFM |
IJ46272 |
High Importance
|
With QoS throttling configuration on a subset of nodes in the cluster, the I/Os on
the rest client nodes without QoS throttling are seriously throttled unexpectedly.
(show details)
Symptom |
I/O hang |
Environment |
All Operating Systems |
Trigger |
Configure QoS throttling for a subset nodes in the cluster. |
Workaround |
Create a node class for the non-QoS throttled nodes and set
"unlimited" QoS throttling for that node class when configuring
QoS for a subset nodes in the cluster. |
|
5.1.2.11 |
QoS |
IJ46273 |
High Importance
|
There were unknown NFS errors hit during recovery and there were no bypass
around these to get recovery to go through.
(show details)
Symptom |
Unexpected Behavior |
Environment |
All Linux OS Environments (AFM Gateway nodes) |
Trigger |
Recovery unable to proceed upon hitting unknown persistent
AFM Recovery errors. |
Workaround |
None |
|
5.1.2.11 |
AFM |
IJ46274 |
High Importance
|
The tsapolicy adds each client process (agent) information to agentVctr to keep track activities.
If agent is retrieved from agentVctr While a helper is being added, it could get vogus agent address and it could result tsapolicy hang.
Adding lock while retrieving agent info can avoid this problem.
(show details)
Symptom |
Component Level Outage |
Environment |
All platforms that support mmapplypolicy |
Trigger |
This problem could occur by mmapplypolicy with large number
of client nodes (-N option) |
Workaround |
None |
|
5.1.2.11 |
mmapplypolicy |
IJ46395 |
High Importance
|
During filesystem restripe process, for example, mmrestripefs -R, a file replication
setting may be changed if the file is ill-replicated, and quota is not handling
correctly after the file data blocks are replicated or un-replicated as needed to
match the new replication settings.
As result, some quota accounting data become unreliable over time.
(show details)
Symptom |
Wrongly quota accounting data. |
Environment |
ALL Operating System environments |
Trigger |
Quota is not handling correctly from data blocks replicated or
un-replicated logic. |
Workaround |
Run mmcheckquota to correct quota values. |
|
5.1.2.11 |
Quotas |
IJ46396 |
Suggested |
getfacl may not display a POSIX default ACL that has been set on a directory.
This occurs in this situation:
- A default ACL is set on a directory in a Scale filesystem using setfacl, but not an access ACL.
- The filesystem is shared using the NFS server included with the operating system.
- The NFS client mounts the filesystem using NFS version 3.
Functionally things seem to work correctly even though getfacl is missing the
default ACL information.
(show details)
Symptom |
Under certain circumstances, getfacl command will not display
information about the default ACLs that has been set on a
directory using setfacl. |
Environment |
ALL Operating System environments |
Trigger |
getfacl may not display a POSIX default ACL that has been set on a directory.
This occurs in this situation:
- A default ACL is set on a directory in a Scale filesystem using setfacl, but not an access ACL.
- The filesystem is shared using the NFS server included with the operating system.
- The NFS client mounts the filesystem using NFS version 3. |
Workaround |
Also set the access ACL using setfacl on affected directories. |
|
5.1.2.11 |
- NFS
- POSIX default ACLs |
IJ46397 |
High Importance
|
The TCT recall process could fail or report some errors during deleting a non-
resident (stub) file that is also in a snapshot.
(show details)
Symptom |
Unexpected behavior and results. |
Environment |
All Operating Systems |
Trigger |
Deleting a non-resident stub file that is also in a snapshot. |
Workaround |
Deleting the snapshots that contains such being deleted non-
resident stub file. |
|
5.1.2.11 |
TCT migration/LWE |
IJ46531 |
High Importance
|
A read() or write() system call on a file descriptor opened in direct I/O mode
accessing file data on a locally attached NSD may hang.
(show details)
Symptom |
Hang/Deadlock/Unresponsiveness/Long Waiters |
Environment |
ALL Linux OS environments |
Trigger |
When preparing the block I/O request to the local attached
block device the GPFS kernel module is unable to get a handle
for the block device.
This can for example happen if the block device is temporarily
unavailable. |
Workaround |
Do not use direct I/O for data stored on locally attached NSDs. Or use direct I/O only on remote attached NSDs. |
|
5.1.2.11 |
NSD Client/Server handling |
IJ46533 |
Critical |
There is a code issue that could result in that an AIO completion event could be
not handled by the AIO completion thread, then form a deadlock with long waiter
for the thread of AcquireBRTHandlerThread or RangeRevokeWorkerThread
waiting for other threads to exit fast path. In addition, such miss-handling for AIO
completion could also cause the file system cannot be quiesced and memory leak issue.
(show details)
Symptom |
Deadlock |
Environment |
Linux Only |
Trigger |
Doing AIO reads/writes from one node and then start normal
buffer I/O load from the other nodes against the same files. |
Workaround |
No |
|
5.1.2.11 |
AIO only |
IJ46534 |
HIPER |
The syntax of mmdsh is as follow:
mmdsh -N {Node[,Node...] | NodeFile | NodeClass}
[-l LoginName] [-i] [-s] [-r RemoteShellPath]
[-v [-R ReportFile]] [-f FanOutValue] Command
In the following example, mmdsh will remove /tmp/someFile.
mmdsh -N "ls -lrt /tmp/someFile"
In this example, the intended nodelist {Node[,Node...] | NodeFile | NodeClass} is missing.
The command takes the next token, the string "ls -lrt /tmp/someFile" as a node list.
It calls a GPFS internal command to obtain a list of nodes in the cluster.
The call to the internal command was not properly done.
The internal command takes file /tmp/someFile as an output file which it removes before write new data to it.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
ALL Operating System environments |
Trigger |
Running mmdsh with node list argument. |
Workaround |
Review the manpage carefully and enter the command correctly
especially ensure the correct list of nodes before you run
mmdsh. |
|
5.1.2.11 |
Admin Commands |
IJ46535 |
High Importance
|
With prefetch run with list-file larger than 2GB, prefetch threads are deployed -
and there are 1 or more threads that need to start operating at list-file offset
higher than 2GB. Since the offset/length is declared to be Int32 - the offset of
higher than 2GB is too big to hold and causes fseek errors which returns an error 4 (E_INTR).
(show details)
Symptom |
Unexpected results. |
Environment |
All Linux OS Platforms (AFM Gateway nodes) |
Trigger |
Running prefetch with list file larger than 2GB in size. |
Workaround |
Split single large list file into multiple smaller list files (each
smaller than 2GB in size) and run prefetch multiple times with
list file chunks created. |
|
5.1.2.11 |
AFM |
IJ46536 |
High Importance
|
When initial sync is triggered for a GPFS fileset (resync in progress) which is
converted to AFM DR, none of the files/dirs have pcache remote or pcache parent
EAs on the inode.
If the initial sync is interrupted (and workload causes some remove/rename kind
of operations on the local directories), then there are dirty directories at primary
for which dirtyDirDirents policy scan is run to list all dirty dir entries.
But for files which don't have state, the policy scan gives 5 fields as compared to
8 expected and results in recovery failing with error 22 (E_INVAL).
(show details)
Symptom |
Unexpected Results |
Environment |
All Linux OS Environments (Serving as AFM Gateway nodes) |
Trigger |
Running AFM recovery on an AFM DR Fileset who's initial
sync has never completed. |
Workaround |
Set afmSkipResyncRecovery to yes on the fileset and trigger
recovery. |
|
5.1.2.11 |
AFM |
IJ46649 |
High Importance
|
sendfile() call returns EINVAL for kernel > 5.10 when target is gpfs file system.
(show details)
Symptom |
sendfile system call failure |
Environment |
Linux with kernel >= 5.10 |
Trigger |
sendfile() call returns EINVAL for kernel >= 5.10 when target is
GPFS file system |
Workaround |
None |
|
5.1.2.11 |
Core GPFS |
IJ45538 |
High Importance
|
When afmFastCreate is configured and a normal file gets copied - it sets the cache/primary mtime at the home/Secondary during file create time itself.
Later if Write gets interrupted mid-way and later a Resync is run on this fileset - the same file is not copied over fully stating that the file is already in sync. It involves a small race to get here.
(show details)
Symptom |
Unexpected Behavior |
Environment |
All Linux OS platforms (AFM Gateway nodes only) |
Trigger |
Running Resync on fileset with afmFastCreate enabled with a partially copied file from cache to home. |
Workaround |
None |
|
5.1.2.10 |
AFM |
IJ44899 |
High Importance
|
File usage quota is effective and some files in the file system have been migrated through DMAPI application, then delete all files in the file system. However, the mmrepquota consistently shows some files still in-use.
(show details)
Symptom |
Error Output |
Environment |
Linux |
Trigger |
Migrate files to external storage through DMAPI function, and create snapshots for the file system, then delete these migrated files. |
Workaround |
Since this problem only happens when there are snapshots, so deleting snapshots can workaround the problem. |
|
5.1.2.10 |
DMAPI |
IJ44889 |
Suggested |
Attr_Expiration_Time value set in EXPORT_DEFAULTS block of gpfs.ganesha.main.conf not reflected in the new export entry created.
(show details)
Symptom |
Check if Attr_Expiration_Time value is proper in /var/mmfs/ces/nfs-config/gpfs.ganesha.exports.conf for the export added using mmnfs |
Environment |
Linux |
Trigger |
Modify Attr_Expiration_Time to different value other than default value and add new export |
Workaround |
Attr_Expiration_Time can be modified in gpfs.ganesha.exports.conf using below steps:
1. Copy /var/mmfs/ces/nfs-config/gpfs.ganesha.exports.conf to /tmp
2. Edit Attr_Expiration_Time field in required export in /tmp/gpfs.ganesha.exports.conf
3. Run below command to copy /tmp/gpfs.ganesha.exports.conf back.
mmnfs export load /tmp/gpfs.ganesha.exports.conf |
|
5.1.2.10 |
NFS-Ganesha |
IJ44682 |
High Importance
|
When file system manager takeover happens as a result of node failure, GPFS will try to do log recovery as needed. Log recovery needs to do disk fencing to prevent further IO on a disk from failed node. If the disk fencing failed, it will proceed to make the disk down, but the problem is that, marking the disk down needs to wait for at least some part of file system manager takeover to finish. This ends with a deadlock issue.
(show details)
Symptom |
Hang/Deadlock/Unresponsiveness/Long Waiters |
Environment |
ALL |
Trigger |
File system manager takeover and log recovery happens with disk fencing error condition, under some timing |
Workaround |
None |
|
5.1.2.10 |
All Scale Users |
IJ45067 |
High Importance
|
Performance degradation resulting in long wait times when doing IOs from Ganesha without File Audit Logging enabled.
(show details)
Symptom |
Performance Degradation |
Environment |
Linux Only |
Trigger |
Doing IO (e.g ls, stat, ...) from a Ganesha mount without FAL enabled. |
Workaround |
None |
|
5.1.2.10 |
File Audit Logging |
IJ45540 |
Suggested |
When opening a file with DIO and issuing AIO I/O requests, the requests need to be aligned to the sector size. This is enforced by GPFS and an error is returned. The problem here was that an internal error code was returned (795), instead an error code that is recognized by Linux applications.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
Open a file for DIO and issue I/O requests through AIO. |
Workaround |
None |
|
5.1.2.10 |
All Scale Users |
IJ45068 |
High Importance
|
A file that is open in Truncate mode, and a write generated on it is later getting a Read on it to cache the file from home in AFM IW mode.
This causes Read to see Write as dependent and ends up deadlocking.
(show details)
Symptom |
Deadlock |
Environment |
Linux |
Trigger |
Open AFM cached file in Truncate mode and Write to it. |
Workaround |
None |
|
5.1.2.10 |
AFM |
IJ45553 |
High Importance
|
'waiting for stripe group takeover' and 'waitForPendingCopyBlockRPCs: nn RPCs pending'.
These long waiters indicate a deadlock that prevents the file system from coming up.
(show details)
Symptom |
Deadlock |
Environment |
ALL |
Trigger |
Sudden death of a NSD server preventing access to some disks |
Workaround |
Bring all NSD servers up |
|
5.1.2.10 |
All Scale Users |
IJ45548 |
Suggested |
ownload for MU and Object-only mode for COS fileset is not working. If cacheBit is set for fileset root then download is not happening.
(show details)
Symptom |
Download failed to happened. |
Environment |
Linux |
Trigger |
Download won’t happen if fileset root is cached. |
Workaround |
None |
|
5.1.2.10 |
AFM-COS |
IJ45266 |
High Importance
|
Daemon assert " logAssertFailed: !fileId.isSnaplinkDir()" going off when calling lseek against the .snapshots directory.
(show details)
Symptom |
Daemon crash |
Environment |
ALL |
Trigger |
Perform lseek request against the .snapshots directory. |
Workaround |
Avoid lseek call to the .snapshots directory. |
|
5.1.2.10 |
Snapshot |
IJ45549 |
High Importance
|
This issue happens when afmFastCreate is enabled and a file is created and written to, and immediately a changeSecondary is run on the fileset (pushing the File Create through fastCreate to the Resync snapshot) on a Primary fileset.
In this case, the file is written from the psnap0 snapshot to keep this snapshot consistent across the sites and sacrifices the data consistency on live fileset.
(show details)
Symptom |
Unexpected Behavior |
Environment |
All Linux OS environments (Serving as AFM Gateway nodes) |
Trigger |
Running ChangeSecondary on DR fileset with afmFastCreate enabled while still writing to a single file. |
Workaround |
None |
|
5.1.2.10 |
AFM |
IJ45403 |
Critical |
The GPFS kernel module exports an ioctl interface used by the mmfsd daemon and some of the mm* commands. The provided improvements result in a more robust functionality of the kernel module.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
ALL |
Trigger |
Not available |
Workaround |
None |
|
5.1.2.10 |
All Scale Users |
IJ45550 |
High Importance
|
When an immutable/appendonly file at primary/cache is made non-imm/non-app first using mmchattr and then immediately the file is removed - then AFM has an issue where the file remove cannot replicate to secondary/home because the file is still imm/app and not allowed to be removed.
(show details)
Symptom |
Unexpected behaviour |
Environment |
ALL |
Trigger |
Making an immutable/appendonly file at primary/cache as non-imm/non-app first using mmchattr and then immediately removing the file. |
Workaround |
Wait for the mmchattr -i no -a no to be replicated first to the secondary/home and file at home/secondary also becomes non-imm/non-app first and later remove the file. |
|
5.1.2.10 |
AFM |
IJ45551 |
Suggested |
The communication port cannot be changed for CCR enabled cluster.
(show details)
Symptom |
Error output/message |
Environment |
ALL |
Trigger |
mchconfig command |
Workaround |
Change the cluster to deprecated server-based configuration then change the port.
After the change, change the cluster back to support CCR cluster. |
|
5.1.2.10 |
Admin Commands |
IJ45608 |
Critical |
Due to an issue identified in offline fsck mmfsck it can cause it to report false positive lost blocks and also not report properly genuine incorrect blocks and duplicates.
(show details)
Symptom |
Will see corruptions like duplicates even after offline fsck repair and subsequent offline fsck runs will show lost blocks and incorrect blocks. |
Environment |
ALL |
Trigger |
This issue will happen on a file system where the user created two or more dataOnly pools and then at some point of time deleted the earlier data pool/s in an out of order fashion (i.e. a dataonly pool (n) is deleted with other data pools (n+x) are present). |
Workaround |
1) Create one or more "dummy" dataOnly pool by adding a single NSD of that "dummy" dataOnly pool to the file system. The NSD of this "dummy" data pool can be of a minimum small size as we do not need to have any data on that "dummy" data pool.
2) After that run offline fsck on the file system and now it should report and repair lost blocks/incorrect block and duplicates in the right way.
3) Once the file system is fixed you can delete the "dummy" data pool by deleting the only NSD in it.
|
|
5.1.2.10 |
FSCK |
IJ45536 |
Medium Importance |
Spectrum Scale and systemhealth monitor (sysmon) start independently after a node reboot.
During initialization, Spectrum Scale checks if all declared NFS exports are available.
The sysmon configuration has the flag "preventnfsstartuponmissingfs" enabled, so the expected behavior was that NFS is not started if a required filesystem is unmounted. But in fact, NFS was started anyway.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
ALL Linux OS environments running CES with enabled NFS protocol |
Trigger |
Spectrum Scale and systemhealth monitor (sysmon) start independently after a node reboot.
During initialization, Spectrum Scale checks if all declared NFS exports are available.
At that point in time the sysmon was still initializing and has not yet done this evaluation. So it returns "no bad configuration found" which triggers then the NFS startup.
The sysmon configuration has the flag "preventnfsstartuponmissingfs" enabled, so the expected behavior was that NFS does not come up.
Ganesha will fail later and trigger an IP address failover, which disturbs the cluster operation. |
Workaround |
N/A
Make sure that the exported filesystems have the automount feature enabled, if possible.
If the missing exported filesystem is not in use anyway, then remove it from the declared export list. |
|
5.1.2.10 |
System Health |
IJ45537 |
High Importance
|
The tsapolicy evaluates each client's workload and rebalance them if some clients are overloaded.
But the workload is sometimes incorrectly calculated and tsapolicy tries to rebalance unnecessary and could get into an infinite loop.
(show details)
Symptom |
Component Level Outage |
Environment |
All platforms that support mmapplypolicy |
Trigger |
This problem could occur if number of inodes in the file system or fileset is large. |
Workaround |
None |
|
5.1.2.10 |
mmapplypolicy |
IJ45777 |
Suggested |
Empty file is having cache bit set and crtime is getting updated in Readdir operation which caused to skip validation with home on truncate operation at home and failed to download the file.
(show details)
Symptom |
Failed to call truncated file at COS |
Environment |
Linux |
Trigger |
Data consistency failed on empty truncated file. |
Workaround |
None |
|
5.1.2.10 |
AFM-COS |
IJ45776 |
High Importance
|
There is a peculiar case where the local bit on the .ptrash directory inside AFM filesets gets reset. This causes the .ptrash directory to be treated like a normal directory and in Write modes, the temporary files generated for recovery/resync policy start getting replicated to the remote site. For Read modes this causes the ptrash directory to show up as a dangling entry because a normal lookup is sent to home - and since the .ptrash doesn't have remote attrs - it fails to complete this lookup successfully. This also causes errors when the user wants to empty the ptrash with rm -rf since the lookups to remote site don't succeed.
(show details)
Symptom |
Unexpected Behavior |
Environment |
All Linux OS environments (AFM Gateway nodes)
All OS Platforms (Application nodes in AFM enabled clusters) |
Trigger |
ptrash local bit getting reset unintentionally and follow up operations performed on the fileset - like ls or recovery |
Workaround |
Manually set the local bit on ptrash on seeing issues. |
|
5.1.2.10 |
AFM |
IJ45797 |
High Importance
|
There is an assert being hit when performing ls -l or prefetch on a brand new RO/LU/IW fileset with data existing at home.
(show details)
Symptom |
Crash |
Environment |
All Linux OS environments (AFM Gateway nodes) |
Trigger |
Running ls or prefetch on new RO/IW/LU fileset at cache with home having data already. |
Workaround |
None |
|
5.1.2.10 |
AFM |
IJ44525 |
Critical |
Due to a race condition between the RDMA software layer and
IBM Spectrum Scale, it is possible that an application running on
an IBM Spectrum Scale client may read incorrect data from files
stored on GPFS under certain conditions.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
Race condition between the RDMA software layer and IBM Spectrum Scale.
|
Workaround |
Disable RDMA. |
|
5.1.2.9 |
RDMA |
IJ44527 |
Suggested |
TIP events can be hidden and then should not count
towards the overall state of the system, however they
still can cause the component rollup to show TIP instead
of Healthy.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
All |
Trigger |
Hiding Tips will prevent them from showing up in the Events
column but did not exclude them from the overall state calculation.
|
Workaround |
None |
|
5.1.2.9 |
System Health |
IJ44547 |
High Importance
|
GPFS daemon could fail unexpectedly with assert:
regP->owner!=fromNode,in allocM.C.
This could happen as result of file system unmounted
on a node due to error.
(show details)
Symptom |
Abend/Crash |
Environment |
All |
Trigger |
File system unmount due to error |
Workaround |
Disable the assert via disableAssert configuration |
|
5.1.2.9 |
All Scale Users |
IJ44553 |
High Importance
|
Code to set ptrash as local was designed to be
enabled if afmRevalOpWaitTimeout remains at
its default value. But it's highly unlikely it
stays default in a customer environment.
(show details)
Symptom |
Unexpected Behavior |
Environment |
Linux (AFM Gateway nodes) |
Trigger |
afmRevalOpWaitTimeout being set to a non-default value
causing ptrash local bit setting code to not take
effect.
|
Workaround |
Setting the afmRevalOpWaitTimeout to its default value
of 180 will ensure ptrash is set to local.
|
|
5.1.2.9 |
AFM |
IJ44567 |
Suggested |
System monitoring collects all information about a cluster
by sending it to relevant nodes. It ignores cluster boundaries
while doing so, which does not work and creates spurious error
messages in the logs.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
Setup with remote cluster integration |
Workaround |
None |
|
5.1.2.9 |
mmfs.log.latest |
IJ44684 |
High Importance
|
Remote error 2 while replicating Link operation if
parent directory is deleted before replicating create/link
operation.
(show details)
Symptom |
AFM Queue drop and Fileset goes to resync state. |
Environment |
Linux |
Trigger |
Create/Link/Parent dir remove operation in queue with Fast Create
config option enabled.
|
Workaround |
None |
|
5.1.2.9 |
AFM |
IJ44685 |
Suggested |
The mmwatch plugin to mmhealth can print or log excess error
messages if there is a filesystem that is offline for some reason.
(show details)
Symptom |
Error output/message |
Environment |
Linux |
Trigger |
Running mmhealth when there is an unmountable filesystem defined. |
Workaround |
The mmwatch plugin to mmhealth can be disabled. |
|
5.1.2.9 |
Admin Commands |
IJ44691 |
High Importance
|
Spectrum Scale Erasure code edition interacts with third party
software/hardware APIs for internal disk enclosure management.
If the management interface becomes degraded and starts to hang
commands in the kernel, the hang may also block communication
handling threads. This causes a node to fail to renew its lease,
causing it to be fenced off from the rest of the cluster.
This may lead to additional outages.
(show details)
Symptom |
Hang/Deadlock/Unresponsiveness/Long Waiters |
Environment |
Linux |
Trigger |
Degradation in back-end storage management that causes
commands to hang in the kernel.
|
Workaround |
The node with hardware problems will show waiters 'Until
NSPDServer discovery completes.'
It is recommended to reboot nodes with those GPFS waiters
exceeding 2 minutes if this node is also being expelled. |
|
5.1.2.9 |
ESS/GNR |
IJ44692 |
Suggested |
"mmdiag --netNwork" is slow.
(show details)
Symptom |
Slow performance in environments with lots of network entries
|
Environment |
All |
Trigger |
A large number of network entries causes
"mmdiag --network" to be noticeably slow.
|
Workaround |
None |
|
5.1.2.9 |
mmdiag |
IJ44774 |
High Importance
|
Commands like mmcrcluster or mmaddnode may hang in GSKIT
layer on AMD EPYC family 25 processors. A particular model
from family 25 that is known to hang in GSKIT layer is
AMD EPYC 7343.
(show details)
Symptom |
Admin commands hangs |
Environment |
Linux |
Trigger |
This problem affects AMD EPYC family 25 processors |
Workaround |
Add "ICC_SHIFT=3" line in /usr/lpp/mmfs/lib/gsk8/Cicc/icclib/ICCSIG.txt
file on problem nodes.
|
|
5.1.2.9 |
Admin Commands, gskit |
IJ44806 |
Suggested |
In GPFS backend, cleanup took the handlerList lock on SGPanic
and at the same time, handler is trying to setup (setupctl) the
fileset mount path by using handler mutex and this is waiting too
(show details)
Symptom |
Waiters |
Environment |
Linux |
Trigger |
Waiters will be seen and fileset is stuck to show progress. |
Workaround |
None |
|
5.1.2.9 |
AFM with GPFS backend |
IJ44828 |
Suggested |
A node (kernel) crash can occur when the vinfoLockOnWrite
config option is enabled.
(show details)
Symptom |
Crash |
Environment |
All |
Trigger |
Timing hole when enabling the undocumented config option
vinfoLockOnWrite, likely triggered by using snapshots
|
Workaround |
Avoided by not enabling the undocumented vinfoLockOnWrite
config option
|
|
5.1.2.9 |
Core GPFS |
IJ44829 |
High Importance
|
The special .afmctl file at home/secondary loses its Control
attribute and is treated as a normal file. This returns a buffer
of expected 2048 size - overflowing the 1100 buffer given for
this at cache - expecting a CTL file treatment at the home/secondary
(show details)
Symptom |
Crash |
Environment |
Linux (AFM Gateway nodes) |
Trigger |
Invalid .afmctl control file at home. |
Workaround |
Manually disable and re-enable mmafmconfig at the
home/secondary and then stop/start the cache fileset
to pickup the new changes from home. |
|
5.1.2.9 |
AFM |
IJ44831 |
High Importance
|
After GPFS 5.1.2 release, on some token manager node, the memory from token management subpool may be leaked.
This can be observed from output of mmfsadm dump malloc:
Statistics for MemoryPool id 3 ("UNPINNED_TM") at 0xF1000012C00246C8:
...
Memory subpool 'HolderList' at 0xF1000012C00258B0
objSize 16 spObjectsPerChunk 65536 expandInProgress 0
inUse 140052583 free 63385 total 140115968 limit 2147483647
the "inUse" filed is increased gradually.
(show details)
Symptom |
Out-of-memory, Unexpected Results/Behavior |
Environment |
All |
Trigger |
During token management, one type of object is missed freed
when the token is destroyed. |
Workaround |
None |
|
5.1.2.9 |
All Scale Users |
IJ43542 |
High Importance
|
When reading the small files containing single data block from multiple threads, these threads could be interlocked with each other on the prefetch check, although they are reading different small files, then the contention on the prefechListMutex cause the performance degradation. A similar issue could happen on concurrent append writes to files from multiple threads.
(show details)
Symptom |
Performance Impact/Degradation |
Environment |
All |
Trigger |
Concurrent reads on different small files from different threads. |
Workaround |
Downgrade the configuration parameter "prefetchAggressiveness" to the 1.
|
|
5.1.2.8 |
All |
IJ44213 |
High Importance
|
The ptrash directory triggers recovery if the fileset is in dropped state. This recovery code tries to set the Ptrash as local where it tries to acquire XW lock on ptrash - which conflicts with the above operation which also holds XW lock on the ptrash directory while trying to queue the operation.
(show details)
Symptom |
Deadlock |
Environment |
Linux |
Trigger |
Remove operations performed on unwanted files inside the .ptrash directory when the local bit was not set on this directory causing recovery to be triggered. |
Workaround |
Make sure that ptrash directory is always local before performing any operations inside them.
|
|
5.1.2.8 |
AFM |
IJ41370 |
High Importance
|
Currently, there is no mechanism to cleanup the subnets contact IPA caches. If the subnets configuration changes and the cached IPA does not work any more, the nodes may not be able to communicate with each others.
(show details)
Symptom |
Performance Impact/Degradation |
Environment |
All |
Trigger |
Normally, GPFS will use daemon IP address for communication, but if the cluster want to use other IP address for communication, they must configure "subnets" configuration.
Then GPFS will use "subnets" IP address for daemon communication. But we need to do following:
- In probing cluster stage, a pair of nodes use daemon IP addresses for communication.
- After the connection is established, pairs of nodes exchange their "subnets" IP addresses
- Close the connection which is using daemon IP addresses
- Establish new connection which is using "subnets" IP addresses.
So, once the "subnets" IP addresses are cached, GPFS uses these cached IP for communication.
The problem occurs when cache "subnets" IP addresses are no longer communicative. Even if a new "subnets" is configured or "subnets" is removed, we cannot use the original "subnets" to exchange the new IP address which the customer wants to use. |
Workaround |
Manually cleanup the stale /var/mmfs/gen/cacache.* files.
|
|
5.1.2.8 |
subnets/remote cluster |
IJ42748 |
High Importance
|
Assertion: exp(fileId.inodeNum > 0)
(show details)
Symptom |
Crash |
Environment |
Linux |
Trigger |
Over stressed Filesystems with AFM DR bi-directional relationships running. |
Workaround |
None.
|
|
5.1.2.8 |
AFM |
IJ43816 |
High Importance
|
The SLES 15 SP4 kernel update 5.14.21-150400.24.11 included a change that causes Spectrum Scale to crash the kernel. A fix in Spectrum Scale is necessary in order to run on this kernel.
(show details)
Symptom |
Abend/Crash |
Environment |
x86_64-linux only |
Trigger |
Run Spectrum Scale with the SLES 15 SP4 kernel update 5.14.21-150400.24.11. |
Workaround |
None.
|
|
5.1.2.8 |
All |
IJ41364 |
Suggested |
With a policy rule configured, there are many jobs that could be scheduled accordingly and the 32 bit pitJobId could be overflowed over time, which causes the assert "(pitJobId >= 0 && pitJobListPP == __null)".
(show details)
Symptom |
Abend/Crash |
Environment |
All |
Trigger |
Configure policy rule to frequently migrate (or compress/decompress and etc) the file system data. |
Workaround |
None |
|
5.1.2.8 |
Policy |
IJ44219 |
Suggested |
Files not replicated on create after failoverToSecondary.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
After failovertosecondary, if you create and write files and then changesecondary to sync with old primary. |
Workaround |
None |
|
5.1.2.8 |
AFM-DR |
IJ44155 |
Suggested |
An attacker can gain sensitive information like vulnerable Framework, components, etc. used, if error message are not handled properly.
(show details)
Symptom |
mmvdisk throw exception. |
Environment |
All |
Trigger |
Nnon-ascii characters in the configuration file causes mmvdisk to throw an exception. |
Workaround |
None |
|
5.1.2.8 |
ESS/GNR |
IJ44144 |
Suggested |
Fixes for the retbleed vulnerability are backported to kernel updates in Linux distributions. These fixes also include checks whether the kernel module build has properly applied the rtbleed mitigations. Parts of the kdump binary built by mmbuildgpl does purposefully not include these mitigations. As a result, the mmbuildgpl process will emit warning messages like CC [M] /usr/lpp/mmfs/src/gpl-linux/kdump-kern.o /usr/lpp/mmfs/src/gpl-linux/.tmp_kdump-kern.o: warning: objtool: GetOffset()+0x14: 'naked' return found in RETHUNK build
(show details)
Symptom |
Error output/message |
Environment |
x86_64-linux only |
Trigger |
Running mmbuildgpl on a kernel that has all fixes for the retbleed vulnerability. |
Workaround |
There is no easy workaround. Without code changes, the only way forward is to ignore those warnings, no other ill effect will happen. |
|
5.1.2.8 |
All |
IJ44143 |
Suggested |
CES ips are not getting assigned to the node and moving around.
(show details)
Symptom |
Unexpected results/behavior |
Environment |
None |
Trigger |
CES resume |
Workaround |
Assign the CES ip to the node |
|
5.1.2.8 |
None |
IJ44119 |
Suggested |
Add vinfoLockOnWrite config to hold vinfo lock for file write operation. Enabling this config can solve the write performance degradation of Ganesha/NFS found between GPFS 5.1.1 and GPFS 5.1.2.
(show details)
Symptom |
Performance Impact/Degradation |
Environment |
All |
Trigger |
Ganesha/NFS write performance degradation is more likely to occur if the number of Ganesha threads is large. |
Workaround |
None |
|
5.1.2.8 |
NFS |
IJ44073 |
Suggested |
If a recovery group creation fails due to a condition in the storage hardware, such as the detection of volatile write caching on the drives, the “mmvdisk recovery group create” command will fail. Once the hardware issue is resolved, it is possible for subsequent attempts of this command to continue to fail until the mmfsd daemons are restarted on the Spectrum Scale RAID storage cluster.
(show details)
Symptom |
Error output/message |
Environment |
Linux |
Trigger |
Hardware problems detected during recovery group creation. |
Workaround |
Restart the mmfsd daemons on the Spectrum Scale RAID storage cluster. |
|
5.1.2.8 |
GNR/ESS |
IJ44059 |
Suggested |
"noAuthentication=yes" can cause sysmonitor daemon to crash which stops mmhealth from working.
(show details)
Symptom |
Abend/Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
Setting noAuthentication=yes |
Workaround |
None |
|
5.1.2.8 |
System Health |
IJ43755 |
Suggested |
Rename on non-empty directory in AFM+COS local-updates mode is not allowed causing the application failures on rename.
(show details)
Symptom |
Unexpected results |
Environment |
Linux |
Trigger |
AFM caching with rename on non-empty directory in AFM+COS local-updates mode |
Workaround |
None |
|
5.1.2.8 |
AFM |
IJ42737 |
Suggested |
NFS fails to resolve the posix filesystem, when a 'tmpfs' type is mounted prior to adding any gpfs export, and then unmounted. This happens because NFS does not repopulate the posix filesystem which leads to mismatch of major and minor number of exports
(show details)
Symptom |
Unexpected results |
Environment |
All |
Trigger |
tmpfs filesystem remains in the posix list maintained by NFS; the list which is not re-populated for every new export add. |
Workaround |
Restart NFS ganesha incase there is major/minor number mismatch of the filesystems |
|
5.1.2.8 |
NFS-Ganesha |
IJ44054 |
Suggested |
LDAP connections are being monitored using the bind passwaord. If that is obfuscated the monitor may fail.
(show details)
Symptom |
Unexpected results |
Environment |
Linux |
Trigger |
Using LDAP server with obfuscated bind PW |
Workaround |
None |
|
5.1.2.8 |
System Health |
IJ42759 |
Suggested |
GPFS commands are calling egrep which produces warning on latest Cygwin64 update.
(show details)
Symptom |
Error output/message |
Environment |
Windows/x86_64 only |
Trigger |
Cygwin64 update |
Workaround |
None |
|
5.1.2.8 |
Admin Commands |
IJ43799 |
Suggested |
Node expel logic tries to avoid expelling nsd servers but in ECE environment it cannot determine this.
(show details)
Symptom |
Node expel/Lost Membership |
Environment |
All |
Trigger |
NSD servers |
Workaround |
None |
|
5.1.2.8 |
GNR |
IJ43806 |
Suggested |
mmafmcosctl object download prints Queued number of items for metedata downloaded, when actually its just directly processing them without queuing.
(show details)
Symptom |
Error Message |
Environment |
Linux |
Trigger |
Running mmafmcosctl download with metadata only option. |
Workaround |
None |
|
5.1.2.8 |
AFM |
IJ41697 |
Suggested |
The "dig" command used to query the status crashed on Ubuntu and SLES when called from the sysmonitor daemon.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
x86_64-linux only (except RHEL) |
Trigger |
Issue does not affect the RHEL. With the SLES and Ubuntu the std stream handling in the sysmon daemon caused the "dig" command to crash. |
Workaround |
None |
|
5.1.2.7 |
System Health |
IJ41620 |
High Importance
|
Running mmbuildgpl on x86_64 with Linux kernels that include fixes for the retbleed vulnerability (CVE-2022-29900) results in an error. As a result, GPFS is not usable with these kernel versions. Specifically, this problem is hit with:
-
SLES 15 SP3 kernel update 5.3.18-150300.59.87.1 or higher
-
SLES 15 SP4 kernel update 5.14.21-150400.24.11.1
-
Ubuntu 22.04 kernel update 5.15.0-45.48
It is expected that the same changes will also be backported to RHEL, but no RHEL kernel updates with retbleed fixes have been released yet.
The same applies to Ubuntu 20.04; no kernel updates have been released yet with this changes, but this should happen eventually.
The information provided by the Linux distributions are useful references:
https://www.suse.com/security/cve/CVE-2022-29900.html
https://ubuntu.com/security/CVE-2022-29900
https://access.redhat.com/security/cve/CVE-2022-29900
(show details)
Symptom |
Component Level Outage (GPFS will be unusable on the node). |
Environment |
Linux (x86_64) |
Trigger |
This problem occurs when updating the Linux kernel to a version with retbleed patches included. |
Workaround |
The required change can also be applied manually:
-
Edit the file /usr/lpp/mmfs/src/gpl-linux/Kbuild
- Around line 100 there is a line:
$(KBHOSTPROGS) := lxtrace
- Before that line, add a new one with:
CFLAGS_kdump-kern.o += -mfunction-return=keep
- Save the file and run mmbuildgpl again.
|
|
5.1.2.7 |
Core GPFS |
IJ41473 |
Critical |
Files or directories that are accessed through CES NFS (Ganesha) and also concurrently accessed at the same time, can report wrong inode attributes. This can appear as data corruption.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
Access a file (or directory) through NFS Ganesha and I modify the same file from another node. |
Workaround |
Since this problem is tied to using CES NFS, not using CES NFS can avoid this problem. |
|
5.1.2.7 |
NFS-ganesha |
IJ41758 |
Suggested |
Part of GPFS are kernel modules that are loaded upon startup and used by other components. Usage counters were not used correctly in the tracedev module, which can lead to the module being unloaded while still in use, resulting in a kernel crash. One case where this is possible is running the "mmvdisk server configure" and "mmvdisk server unconfigure" commands with the --recycle option.
(show details)
Symptom |
Abend/Crash |
Environment |
Linux |
Trigger |
Run GPFS shutdown and startup. This is a rare problem, so running this or the mentioned "mmvdisk server" command in a loop will be necessary to trigger the problem. |
Workaround |
Avoid stopping GPFS immediately after starting up. |
|
5.1.2.7 |
Core GPFS |
IJ41651 |
Critical |
Linux kernel 4.2 added a new field to the Linux inode data structure. When an inode is reused under heavy workload, this field might not be initialized correctly, leading to a kernel crash when accessing the symlink.
(show details)
Symptom |
Abend/Crash |
Environment |
Linux (excluding RHEL7)
|
Trigger |
This problem is highly depended on the workload. If there is a workload creating directories, creating files underneath directories, deleting directories and also creating symlinks, there is a chance that this problem is hit. Build systems are a type of software that can exhibit this pattern.
|
Workaround |
It is possible to manually patch the GPl layer:
-
Edit the file /usr/lpp/mmfs/src/gpl-linux/inode.c
- In function cxiSetOSNode after line: "case S_IFLNK:"
insert a new line with:
inodeP->i_link = NULL;
- Run mmbuildgpl again and restart GPFS on the node.
|
|
5.1.2.7 |
Core GPFS |
IJ41831 |
High Importance
|
If a policy scan, initiated from the mmbackup command, fails and the mmbackup shadowDB file contains an entry for a file that was previously backed up but is now deleted, and the inode of that file has been assigned to a newly created file, then the mmbackup shadowDB file will have duplicate records for that file.
(show details)
Symptom |
Component Level Outage |
Environment |
All |
Trigger |
This problem occurs under the following conditions:
- An entry exist in the mmbackup shadowDB for a file that has been deleted.
- The inode for the file described in condition 1 has been assigned to a newly created file that needs to be backed up.
- The policy scan done by mmbackup fails.
|
Workaround |
Fix the root cause of policy scan failure and rebuild shadowDB.
|
|
5.1.2.7 |
mmbackup |
IJ42150 |
Suggested |
mmafmctl prefetch -Y hits segfault
(show details)
Symptom |
segfault. |
Environment |
Linux |
Trigger |
mmafmctl prefetch command with -Y option |
Workaround |
None |
|
5.1.2.7 |
AFM |
IJ42164 |
High Importance
|
AFM gateway daemon crashes during the resync due to invalid logAssert.
(show details)
Symptom |
Crash |
Environment |
Linux |
Trigger |
AFM replication |
Workaround |
None |
|
5.1.2.7 |
AFM |
IJ42165 |
High Importance
|
AFM+COS replication gets stuck with requeued messages when a file is created, deleted and recreated with the same name before the replication is started to the COS.
(show details)
Symptom |
Unexpected results. |
Environment |
Linux |
Trigger |
AFM replication |
Workaround |
None |
|
5.1.2.7 |
AFM |
IJ42267 |
Suggested |
While using mmafmcsctl download --all, the download will fail if the directory contains a space in the name.
(show details)
Symptom |
Failed to download files |
Environment |
Linux |
Trigger |
mmafmcosctl download --all |
Workaround |
None |
|
5.1.2.7 |
AFM |
IJ41327 |
Suggested |
When one CES node gets rebooted, NFS client lock requests might fail with a "NLM_DENIED" error.
(show details)
Symptom |
Lock request will fail (NLM_DENIED or NLM_BLOCKED error can be seen in tcpdump reply frame of LOCK Request).
|
Environment |
All |
Trigger |
When one of the protocol nodes of the cluster gets rebooted or a failover happens and a lock request is attempted on the same file./td>
|
Workaround |
None in NFSv3. Issue not present in NFSv4. So one work around can be using NFSv4 instead of NFSv3.
|
|
5.1.2.7 |
NFS-ganesha |
IJ42301 |
High Importance
|
AFM recovery fails with error 80 due to incorrect checks for the inode attributes. This error causes the replication to be stuck.
(show details)
Symptom |
Unexpected results. |
Environment |
Linux |
Trigger |
AFM recovery |
Workaround |
None |
|
5.1.2.7 |
AFM |
IJ42467 |
High Importance
|
AFM gateway node deadlocks during the read operation if both prefetch and application tries to read the same file simultaneously.
(show details)
Symptom |
Deadlock |
Environment |
All |
Trigger |
Read operation on AFM uncached file |
Workaround |
None |
|
5.1.2.7 |
AFM |
IJ42500 |
High Importance
|
Below assert going off:
logAssertFailed: totalReceived ==
scatteredP->scattered_total_len || (totalReceived == 0
&& scatteredIndex == scatteredP->scattered_count)
(show details)
Symptom |
Abend/Crash |
Environment |
All |
Trigger |
Network is not good which leads to TCP connection reconnect. |
Workaround |
None |
|
5.1.2.7 |
Core GPFS |
IJ42511 |
High Importance
|
When an NVMe device is becoming active, it is necessary for ESS to poll the device to determine if it is ready for I/O. It does this by polling the final LBA of the device to see if reads are allowed.
This is because the devices become visible to the OS prior to becoming ready to handle read/write requests.
The original implementation, however, would incorrectly claim that media errors on the final LBA mean that the device isn't ready.
As a result, it is possible that legitimate media problems on the final LBA of an NVMe will induce ESS to claim that the entire device is not available.
This problem can be identified by an NVMe pdisk going missing after seeing unrecovered read errors in the Spectrum Scale RAID recovery group event log (mmvdisk recoverygroup list --events).
(show details)
Symptom |
Component Level Outage |
Environment |
Linux |
Trigger |
Corrupted physical block mapped to the final logical block within an NVMe namespace.
|
Workaround |
None |
|
5.1.2.7 |
ESS/GNR |
IJ42229 |
Critical |
If verbsRdmaSend configuration is enabled, and the verbs connection is disconnected and reconnected due to any error other than node shutdown or node failure, it may cause some RPC reply messages to be left in the internal table unintentionally.
These messages will remain in the internal table forever, as none of ack messages can clean them up. Deadlock will not occur immediately, because these RPC messages have been processed correctly. However, the problem may occur when the 32-bit message IDs are wrapped and reused.
Some new messages may be recognized as duplicated RPCs and be rejected by the destination node. These new messages will stay in 'pending' state and result in deadlock.
(show details)
Symptom |
Hang/Deadlock/Unresponsiveness/Long Waiters |
Environment |
All |
Trigger |
For a cluster which has the verbsRdmaSend configuration enabled, this problem may occur if the verbs connection is disconnected and reconnected due to any error other than node shutdown or node failure (for example because of network issue). |
Workaround |
Recycle GPFS daemon. |
|
5.1.2.7 |
RDMA |
IJ43167 |
High Importance
|
mmbuildgpl fails on SLES 15.3, new kernel 5.3.18-150300.59.90-default with error as below:
“No rule to make target 'vmlinux', needed by
'/usr/lpp/mmfs/src/gpl-linux/kdump-kern-dummy.ko”
(show details)
Symptom |
mmbuildgpl will fail on SLES 15.3 kernel version 5.3.18-150300.59.90-default
|
Environment |
SLES 15.3 kernel version 5.3.18-150300.59.90-default (all architectures).
|
Trigger |
mmbuildgpl will fail on SLES 15.3 when kernel is upgraded to 5.3.18-150300.59.90-default |
Workaround |
Clear KBUILD_BUILTIN macro inside /usr/lpp/mmfs/src/gpl-linux/Kbuild
KBUILD_BUILTIN :=
This can be done after below surrounding code:
#For s390x: -pg and -fomit-frame-pointer are incompatible
ifeq ($(ARCH),s390)
ifdef CONFIG_FUNCTION_TRACER
ORIG_CFLAGS := $(KBUILD_CFLAGS)
KBUILD_CFLAGS = $(subst -pg,,$(ORIG_CFLAGS))
endif
endif
KBUILD_BUILTIN :=
|
|
5.1.2.7 |
Build |
IJ43330 |
High Importance
|
logAssertFailed: fileId.inodeNum > 0 when running AFM Recovery or Resync
(show details)
Symptom |
Lost Membership |
Environment |
Linux |
Trigger |
Role Reversal to make old Primary as Secondary and the old Secondary being promoted to Primary.
|
Workaround |
None |
|
5.1.2.7 |
AFM |
IJ40659 |
Suggested |
Trace parameters set through the mmtracectl command does not keep the node classes.
(show details)
Symptom |
Unexpected behavior |
Environment |
All |
Trigger |
Set trace parameters with mmtracectl command. |
Workaround |
Explicitly set the trace parameters via mmchconfig command. |
|
5.1.2.6 |
Admin |
IJ40707 |
Suggested |
The mmlsquota reports duplicate lines when issuing the -C option.
(show details)
Symptom |
Duplicate output |
Environment |
All |
Trigger |
Specify the Device argument that also belongs to the remote cluster in the -C argument. |
Workaround |
Specify the Device argument that does not belong to the -C ClusterName. |
|
5.1.2.6 |
Admin Commands |
IJ40709 |
Suggested |
GPFS fails to process the kmipServerUri field in a remote key manager stanza in the RKM.conf file if provided as an IPv address, e.g., kmipServerUri = tls://[fd9a:f0d0:1002:11::31]:5696.
(show details)
Symptom |
Failure to read files from encrypted file systems/sets. |
Environment |
All |
Trigger |
None |
Workaround |
Use the hostname instead. |
|
5.1.2.6 |
Security |
IJ40754 |
High Importance
|
Running a blocking trace when the node is low on memory and swapping, can lead to a deadlock.
(show details)
Symptom |
Hang/Deadlock/Unresponsiveness/Long Waiters |
Environment |
Linux |
Trigger |
Run traces in blocking mode, while the available memory is low and processes are getting swapped out. |
Workaround |
Ensure that sufficient free memory is available, so that the trace tool is not being swapped out. |
|
5.1.2.6 |
Trace |
IJ40815 |
High Importance
|
AFM Recovery procedure sometimes fails with error 112.
(show details)
Symptom |
Unexpected Behavior |
Environment |
Linux (AFM gateway node) |
Trigger |
Running recovery on a fileset who's .ptrash directory has local bit reset on it. |
Workaround |
Setting the ptrash bit manually on the .ptrash directory (if it is found to be reset) |
|
5.1.2.6 |
AFM |
IJ40817 |
High Importance
|
A node delete for in an ECE cluster will cause the declustered array to be stuck in critical rebuild, preventing the system from doing any data rebuild function.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
Remove an ECE node with mmvdisk. |
Workaround |
None |
|
5.1.2.6 |
ESS, GNR |
IJ39267 |
High Importance
|
CCR becomes slow on a quorum node when the configured firewall drops the FIN TCP/IP packages of CCR requests.
(show details)
Symptom |
Performance impact/degradation. |
Environment |
Linux (x86_64) |
Trigger |
Misconfigured firewall. |
Workaround |
None |
|
5.1.2.6 |
CCR |
IJ40863 |
High Importance
|
"mmsdrrestore --ccr-repair" is not removing CCR tiebreaker disks from the cluster configuration in case those CCR tiebreaker disks aren't available when this command is executed. This happens only in case the CCR nodes file '/var/mmfs/ccr/ccr.nodes' is not available on the quorum nodes.
(show details)
Symptom |
Unexpected results/behavior |
Environment |
All |
|
Trigger |
'/var/mmfs/ccr/ccr.nodes' not available on the quorum nodes in conjunction with CCR tiebreaker disks not accessible on those quorum nodes. |
Workaround |
None |
|
5.1.2.6 |
CCR, Admin command "mmsdrrestore --ccr-repair" |
IJ39112 |
High Importance
|
Mutex contention could lead to slow write performance on AIX when there are multiple threads trying to flush the same file that contain many blocks at same time.
(show details)
Symptom |
Performance Impact/Degradation. |
Environment |
AIX/Power, Windows (x86_64) |
Trigger |
Multiple threads invoking sync on the same file at the same time. |
Workaround |
None |
|
5.1.2.6 |
Core GPFS |
IJ40726 |
High Importance
|
A problem was identified when running in a mixed level cluster where some nodes support msgqueue and others do not. Excessive librdkafka threads will be created for each IO event on the 5.1.2+ nodes resulting in thread exhaustion for that particular node.
(show details)
Symptom |
Hang/Deadlock/Unresponsiveness/Long Waiters |
Environment |
Linux |
Trigger |
Running a cluster where msgqueue is supported. Upgrading a node to 5.1.2+ where msgqueue is no longer supported. Running IO to the 5.1.2+ node. |
Workaround |
None |
|
5.1.2.6 |
Watch Folder, File audit logging |
IJ41097 |
High Importance
|
Symlink is not fetched from home on AFM cache fileset if the gateway kernel version is ≥ 5.10. This happens because memory is not allocated for symlink target path.
(show details)
Symptom |
Unexpected results |
Environment |
Linux |
Trigger |
AFM caching with symlinks. |
Workaround |
None |
|
5.1.2.6 |
AFM |
IJ41098 |
High Importance
|
AFM gateway asserts when replicating the Rmdir operation on a dependent fileset.
(show details)
Symptom |
Assert |
Environment |
Linux |
Trigger |
AFM caching with dependent filesets. |
Workaround |
None |
|
5.1.2.6 |
AFM |
IJ41099 |
High Importance
|
Resync is not able to create hardlink if the file is evicted while the link op is in the queue.
(show details)
Symptom |
Hardlink operation requeued. |
Environment |
Linux |
Trigger |
AFM caching with hardlinks |
Workaround |
None |
|
5.1.2.6 |
AFM |
IJ41100 |
High Importance
|
Lookup on hardlinks fails intermittently on AFM cache filesets. This is due to a race between multiple threads performing the lookup of the same hardlink from different directories.
(show details)
Symptom |
Unexpected results |
Environment |
Linux |
Trigger |
AFM caching with hardlinks |
Workaround |
None |
|
5.1.2.6 |
AFM |
IJ41101 |
Suggested |
If file is accessed by SMB and AFM tries to replicate the same file, it requeues the operation due to lock conflict. It replicates it later when the file is closed by SMB.
(show details)
Symptom |
Write operation requeued. |
Environment |
Linux |
Trigger |
Simultaneous access of file from SMB and AFM. |
Workaround |
None |
|
5.1.2.6 |
AFM |
IJ41105 |
High Importance
|
When rename/remove operations are performed on dependent filesets which are linked inside AFM independent filesets, and these operations get replicated to the remote site - the local removed/renamed inodes are not reclaimed resulting in extra inodes being held inUse than actually necessary.
(show details)
Symptom |
Unexpected Behavior |
Environment |
Linux (AFM Gateway nodes) |
Trigger |
Remove/Rename being performed on the dependent fileset inodes - when this dependent fileset is linked under an AFM independent fileset. |
Workaround |
None |
|
5.1.2.6 |
AFM |
IJ41014 |
Critical |
After upgrading Scale on the exporting kNFS nodes to
5.1.3.0 (or 5.1.2.4) NFS clients mounting from Scale
report stale file handles after a while.
(show details)
Symptom |
IO Error |
Environment |
Linux |
Trigger |
This is triggered by the NFS client sending a NFS commit message to the Linux kernel nfsd server on a GPFS node. The exact trigger depends on the NFS client, and memory usage on the NFS client system, so can be hard to predict. |
Workaround |
There is no direct workaround. Using the CES protocol stack, which uses NFS Ganesha could be a workaround, but is a larger config change. |
|
5.1.2.6 |
NFS |
IJ41211 |
High Importance
|
Objects are not fully prefetched at the Cache on reading 4th block when afmPrefetchThreshold is set to 0, and io pattern is random.
(show details)
Symptom |
Unexpected Behavior |
Environment |
Linux (AFM Gateway nodes) |
Trigger |
- In RO/LU/IW/SW mode of operation, with AFM COS as the backend have an uncached file (evict file in case of SW or IW from cache).
- Read 4 data blocks randomly on the file at cache.. (make sure no 2 blocks are read sequentially).
|
Workaround |
Read 4 blocks sequentially as compared to random. |
|
5.1.2.6 |
AFM |
IJ41133 |
Suggested |
When recovering from a kafka down period, an audit event is sent to indicate the number of events that were dropped as well as a subEvent indicating what happened. This subEvent contained invalid json.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
External kafka server needs to be down while audit events are being created. Once the external kafka server is back up, it will receive this event. |
Workaround |
An exception can be caught with json parsers when the invalid j on is detected. |
|
5.1.2.6 |
Watch folder |
IJ41134 |
Suggested |
After upgrade spectrum scale version from 5.1.2.0-5.1.2.3 to 5.1.2.4 or any higher version, the NFSv4 client will throw below "unknown error 521" and failed to access NFS share
(show details)
Symptom |
NFSv4 Clients throws "unknown error 521" after upgrade. |
Environment |
All |
Trigger |
The issue is because of NFSv4 File handle size change in 5.1.2.4 or any higher version. |
Workaround |
unmount and remount NFSv4 share in all NFS clients. |
|
5.1.2.6 |
cNFS, NFS |
IJ41254 |
Suggested |
NFS-ganesha crashed with below stack
during file lock request from nfs client.
(gdb) bt
#0 0x00007f889ae809bf in raise () from /lib64/libpthread.so.0
#1 0x00000000004427b8 in crash_handler
(signo=11, info=0x7f86fa1debb0, ctx=0x7f86fa1dea80)
at /usr/src/debug/
gpfs.nfs-ganesha-2.7.5-ibm067.04.295048.el8.x86_64/
MainNFSD/nfs_init.c:239
#2 <signal handler called>
#3 lock_entry_dec_ref (lock_entry=0x7f86386bd2c0) at
/usr/src/debug/
gpfs.nfs-ganesha-2.7.5-ibm067.04.295048.el8.x86_64/
SAL/state_lock.c:708
#4 0x00000000004ae6fa in free_cookie
(cookie_entry=0x7f85c02144a0, unblock=true)
at /usr/src/debug/
gpfs.nfs-ganesha-2.7.5-ibm067.04.295048.el8.x86_64/
SAL/state_lock.c:1371
#5 0x00000000004af31e in state_complete_grant
(cookie_entry=0x7f85c02144a0)
at /usr/src/debug/
gpfs.nfs-ganesha-2.7.5-ibm067.04.295048.el8.x86_64/
SAL/state_lock.c:1717
#6 0x0000000000499098 in nlm4_Granted_Res
(args=0x7f84fc602e38, req=0x7f84fc602730, res=0x7f84fc309a00)
at /usr/src/debug/
gpfs.nfs-ganesha-2.7.5-ibm067.04.295048.el8.x86_64/
Protocols/NLM/nlm_Granted_Res.c:101
#7 0x000000000045a0ab in nfs_rpc_process_request
(reqdata=0x7f84fc602730)
at /usr/src/debug/
gpfs.nfs-ganesha-2.7.5-ibm067.04.295048.el8.x86_64/
MainNFSD/nfs_worker_thread.c:1331
#8 0x000000000045a97e in nfs_rpc_valid_NLM
(req=0x7f84fc602730)
at /usr/src/debug/
gpfs.nfs-ganesha-2.7.5-ibm067.04.295048.el8.x86_64/
MainNFSD/nfs_worker_thread.c:1593
#9 0x00007f889c8e6538 in svc_vc_decode (req=0x7f84fc602730) at
/usr/src/debug/
gpfs.nfs-ganesha-2.7.5-ibm067.04.295048.el8.x86_64/
libntirpc/src/svc_vc.c:834
#10 0x000000000044d1aa in nfs_rpc_decode_request
(xprt=0x7f866c7d4500, xdrs=0x7f84fc557f70)
at /usr/src/debug/
gpfs.nfs-ganesha-2.7.5-ibm067.04.295048.el8.x86_64/
MainNFSD/nfs_rpc_dispatcher_thread.c:1349
#11 0x00007f889c8e6449 in svc_vc_recv (xprt=0x7f866c7d4500) at
/usr/src/debug/
gpfs.nfs-ganesha-2.7.5-ibm067.04.295048.el8.x86_64/
libntirpc/src/svc_vc.c:807
#12 0x00007f889c8e2b91 in svc_rqst_xprt_task
(wpe=0x7f866c7d4758)
at /usr/src/debug/
gpfs.nfs-ganesha-2.7.5-ibm067.04.295048.el8.x86_64/
libntirpc/src/svc_rqst.c:779
#13 0x00007f889c8e3050 in svc_rqst_epoll_events
(sr_rec=0x5572860, n_events=1)
at /usr/src/debug/
gpfs.nfs-ganesha-2.7.5-ibm067.04.295048.el8.x86_64/
libntirpc/src/svc_rqst.c:956
#14 0x00007f889c8e32e9 in svc_rqst_epoll_loop (sr_rec=0x5572860)
at /usr/src/debug/
gpfs.nfs-ganesha-2.7.5-ibm067.04.295048.el8.x86_64/
libntirpc/src/svc_rqst.c:1029
#15 0x00007f889c8e339f in svc_rqst_run_task (wpe=0x5572860)
at /usr/src/debug/
gpfs.nfs-ganesha-2.7.5-ibm067.04.295048.el8.x86_64/
libntirpc/src/svc_rqst.c:1065
#16 0x00007f889c8ebc2d in work_pool_thread (arg=0x7f84c0006270)
at /usr/src/debug/
gpfs.nfs-ganesha-2.7.5-ibm067.04.295048.el8.x86_64/
libntirpc/src/work_pool.c:181
#17 0x00007f889ae7614a in start_thread () from
/lib64/libpthread.so.0
#18 0x00007f889a783dc3 in clone () from /lib64/libc.so.6
(show details)
Symptom |
Crash |
Environment |
All |
Trigger |
The users might hit the crash if the same file is accessed by multiple clients and lot of overlapping file operations(create/delete/lock). |
Workaround |
None |
|
5.1.2.6 |
cNFS, NFS-ganesha |
IJ41328 |
High Importance
|
Synchronous on-demand Read which triggers recovery on Independent-Writer mode fileset can block recovery if the home file is in migrated state.
(show details)
Symptom |
Performance Impact |
Environment |
Linux (AFM gateway node) |
Trigger |
Sync read being performed on IW fileset which needs recovery to be run and the file being migrated to HSM at the home site. |
Workaround |
Trigger recovery separately on IW fileset through ls or touch operations and then trigger such sync reads on uncached file which might be migrated to HSM at the home site. |
|
5.1.2.6 |
AFM |
IJ41280 |
Critical |
CES cluster is showing an obscure error.
(show details)
Symptom |
None |
Environment |
None |
Trigger |
CES resume |
Workaround |
Bring the CES cluster up with no errors. |
|
5.1.2.6 |
None |
IJ41281 |
High Importance
|
Cluster manager takeover thread causes deadlock when UID remapping is enabled.
(show details)
Symptom |
Deadlock |
Environment |
All |
Trigger |
UID remapping with cluster manager takeover. |
Workaround |
None |
|
5.1.2.6 |
UID remapping |
IJ41282 |
High Importance
|
Daemon asserts when the number of UID remap entries are more than 8192. This issue happens due to an incorrect logAssert when UID remapping is enabled.
(show details)
Symptom |
Assert |
Environment |
All |
Trigger |
UID remapping and user entries to remap are greater than 8192. |
Workaround |
None |
|
5.1.2.6 |
UID remapping |
IJ41374 |
High Importance
|
FSSTRUCT errors logged in the system log file, and after formatting these errors with the fsstructlx.awk tool, the FSSTRCUT error is FSErrValidate (108) i with type=eaOverflowBlock.
(show details)
Symptom |
FSSTRUCT error reported in system log file. |
Environment |
All |
Trigger |
Extended attributes does exhaust the free inode space and start to allocate overflow blocks, while snapshot is in use as well. |
Workaround |
None |
|
5.1.2.6 |
Snapshot and extended attribute |
IJ39624 |
Suggested |
On latest Cygwin (versions ≥ 3.3), an attempt to uninstall GPFS on Windows might display a dialog box complaining about access denied on uninstall.lnk. The dialog box presents options to Abort, Retry, or Ignore the error. Ignoring the error bypasses the issue and results in a successful uninstall.
(show details)
Symptom |
Upgrade/Install failure. |
Environment |
Windows (x86_64) |
Trigger |
Cygwin version ≥ 3.3. |
Workaround |
When presented with the dialog box complaining about uninstall.lnk, click on "Ignore" and that should let the uninstall complete.
Then from an elevated Cygwin terminal:
cd /usr/lpp/mmfs/support; chmod 777 uninstall.lnk; rm uninstall.lnk |
|
5.1.2.5 |
Install, Upgrade |
IJ39626 |
Suggested |
Not all ACL update interfaces understand and preserve the rich Windows ACL flags that only get set via native Windows ACL interfaces such as icacls or Explorer GUI. For example, mmputacl on any supported platform could clobber these flags. Hence, even if the ACL-flags are somehow blank, there still might be a valid ACL.
(show details)
Symptom |
Unexpected Results/Behavior. |
Environment |
Windows (x86_64) |
Trigger |
GPFS ACL updates (like mmputacl, mmeditacl etc) that do not preserve the rich Windows ACL flags. |
Workaround |
None |
|
5.1.2.5 |
Authentication, ACLs |
IJ39945 |
Critical |
Replica mismatch could occur if file system panic or node fails while there are directIO writes in progress. This could happen on a file system with data replication and rapid repair is enabled.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
All |
Trigger |
File system panic or node failure while directIO write is in progress and down disk in 1 or more replica |
Workaround |
Disable rapid repair feature on the file system. |
|
5.1.2.5 |
Core GPFS |
IJ39946 |
Suggested |
Assert respPP ≠ NULL in AFM environment.
(show details)
Symptom |
Crash |
Environment |
Linux |
Trigger |
Unresponsive AFM home |
Workaround |
None |
|
5.1.2.5 |
AFM |
IJ40024 |
High Importance
|
AFM Object download does not honor refresh intervals causing performance issues. For example, the list operation is sent to the COS before the refresh interval.
(show details)
Symptom |
Performance impact |
Environment |
Linux |
Trigger |
Object download on a large bucket. |
Workaround |
None |
|
5.1.2.5 |
AFM |
IJ40027 |
High Importance
|
AFM is not able to bail out stuck messages on the replication queue when afmFastCreate is enabled and the home is stuck
(show details)
Symptom |
Long Waiters |
Environment |
All |
Trigger |
Having afmFastCreate enabled on AFM replication fileset and having huge Writes on files which might get stuck on a home which is not responding. |
Workaround |
None |
|
5.1.2.5 |
AFM |
IJ40028 |
High Importance
|
GPFS daemon assert: exp(updateInProgress == 0) in file repUpdate.C
(show details)
Symptom |
Abend/Crash |
Environment |
All |
Trigger |
Multiple node read/write to the same file. |
Workaround |
None |
|
5.1.2.5 |
Core GPFS |
IJ38093 |
High Importance
|
Deadlock after changing the AFM gateway node using mmchnode command as the node change is not propagated correctly to all the nodes in the cluster.
(show details)
Symptom |
Deadlock |
Environment |
Linux |
Trigger |
mmchnode --gateway/--nogateway. |
Workaround |
Restart GPFS on gateway nodes. |
|
5.1.2.5 |
AFM |
IJ40029 |
Suggested |
Collecting data about running threads on the node (e.g. from a gpfs.snap), concurrently to a mmfsd restart can crash the node.
(show details)
Symptom |
Abend/Crash |
Environment |
Linux |
Trigger |
Collect debug data for kernel threads (e.g. from gpfs.snap) concurrently while mmfsd is restarting. |
Workaround |
Avoid debug data collection (e.g. gpfs.snap) while mmfsd is restarting. |
|
5.1.2.5 |
Core GPFS |
IJ40034 |
High Importance
|
AFM object replication fails on files
with 64-bit inode numbers.
(show details)
Symptom |
Unexpected results |
Environment |
Linux |
Trigger |
Upload on objects with 64-bit inode numbers. |
Workaround |
None |
|
5.1.2.5 |
AFM |
IJ39454 |
High Importance
|
GPFS daemon crashes with logAssertFailed: !"Trash_Domain" in file tokenclass.C.
(show details)
Symptom |
Abend/Crash |
Environment |
All |
Trigger |
Unmount the file system for any reason. |
Workaround |
Disable this assert via mmchconfig disableAssert. |
|
5.1.2.5 |
Core GPFS |
IJ40064 |
High Importance
|
Network instability triggering socket reconnects can cause certain IBM Spectrum Scale messages to be lost and not re-transmitted. Additionally, its network instability provokes a node failure, these lost messages can prevent the cluster from moving forward with the cluster-wide node leave protocol. This hang can prevent loss of cluster function including file system availability.
(show details)
Symptom |
Hang/Deadlock/Unresponsiveness/Long Waiters |
Environment |
Linux |
Trigger |
Node expel during socket reconnects. |
Workaround |
Restart the cluster. |
|
5.1.2.5 |
ESS, GNR |
IJ39013 |
Suggested |
In some large-scale deployment, high concurrent lseek(SEEK_HOLE) calls to a specific file might cause performance degradation.
(show details)
Symptom |
Performance Impact/Degradation |
Environment |
Linux |
Trigger |
High concurrent lseek(SEEK_HOLE) calls on the same file. |
Workaround |
If the lseek(SEEK_HOLE) is being invoked from a grep CLI, the '-a' option can bypass the lseek(SEEK_HOLE) call. |
|
5.1.2.5 |
Core GPFS |
IJ40410 |
High Importance
|
When adding disks, the block allocation map is extended by adding new blocks. If a block already exists at the location but is outside the current file size, then this assert is hit.
(show details)
Symptom |
Node expel/Lost Membership |
Environment |
All |
Trigger |
New disk add. |
Workaround |
None |
|
5.1.2.5 |
Core GPFS |
IJ40411 |
High Importance
|
Readdir fails on AFM+COS filesets with -gcs option as the directory entries are created with an incorrect type.
(show details)
Symptom |
Unexpected results |
Environment |
Linux |
Trigger |
AFM+COS fileset access with gcs option. |
Workaround |
None |
|
5.1.2.5 |
AFM |
IJ40414 |
High Importance
|
When a filesetdf feature is enabled without the quota management, the df command on an independent fileset should return the values correponding to the file system instead of garbage.
(show details)
Symptom |
Random output from df command. |
Environment |
All |
Trigger |
df command on filesetdf enabled and no quota management file system. |
Workaround |
Enable quota management when using the filesetdf feature. |
|
5.1.2.5 |
Quotas |
IJ40464 |
High Importance
|
The SUID and SGID bits are not cleared after a successful write or truncate to a file by a non-owner.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
Create a file with the SUID and SGID bits set. As a non-owner or non-root user, write to the file with the write() system call or truncate the file with the truncate() system call. |
Workaround |
Ensure that only owners can write to an executable binary file that has the SUID/SGID bit set. |
|
5.1.2.5 |
Core GPFS |
IJ40563 |
Suggested |
Create operation hitting error 2.
(show details)
Symptom |
Operation queue gets dropped. |
Environment |
Linux |
Trigger |
Error 2 hits and queue gets dropped. |
Workaround |
None |
|
5.1.2.5 |
AFM COS |
IJ40564 |
Suggested |
While mapping configured, AFM COS is not using the NON-MDS node(mapping) to replicate the write and create operations as part of queue executions.
(show details)
Symptom |
Replication is happening from MDS node only in mapping. |
Environment |
Linux |
Trigger |
NON-MDS node is not being used in mapping for create and write operations. |
Workaround |
None |
|
5.1.2.5 |
AFM COS |
IJ40565 |
High Importance
|
AFM gateway daemon assert with (handlerListLock.isLocked() or DaemonShuttingDown)
(show details)
Symptom |
Crash |
Environment |
Linux |
Trigger |
AFM gateway node leaving the cluster. |
Workaround |
None |
|
5.1.2.5 |
AFM |
IJ40280 |
Critical |
Potential for data integrity issues on all clusters using RDMA.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
Race condition between the RDMA software layer and IBM Spectrum Scale when reading data. |
Workaround |
Disable RDMA or set nsdCksumTraditional configuration parameter to "yes". |
|
5.1.2.5 |
RDMA |
IJ38554 |
Critical |
Deadlock during AFM queue flush.
(show details)
Symptom |
Deadlock |
Environment |
Linux |
Trigger |
Stress testing |
Workaround |
None |
|
5.1.2.4 |
AFM |
IJ38784 |
Critical |
While updating the symlink target path on an AFM enabled fileset, the inode is not copied to the previous snapshot causing the assert.
(show details)
Symptom |
Crash |
Environment |
Linux |
Trigger |
AFM caching with symlinks and snapshots. |
Workaround |
None |
|
5.1.2.4 |
AFM |
IJ38785 |
High Importance
|
The SUID and SGID bits are not cleared after a successful write/truncate to a file by a non-owner.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
Create a file with the SUID and SGID bits set. As a non-root user or a non-group member user, write to the file with the write() system call or truncate the file with the truncate() system call. |
Workaround |
Ensure that only owners can write to an executable binary file that has the SUID/SGID bit set. |
|
5.1.2.4 |
Core GPFS |
IJ38786 |
High Importance
|
Given a parent directory with the SGID bit set, a file created with the SGID bit specified by a user who does not belong to the same group as the directory can still have the SGID bit set.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
All |
Trigger |
Create a file with the SGID bit specified as a non-member group user in a directory with the SGID bit set. |
Workaround |
Remove the SGID bit from the directory. |
|
5.1.2.4 |
Core GPFS |
IJ38807 |
Suggested |
Issuing io_uring IORING_OP_READ_FIXED requests to read data into preallocated buffers fails with an error.
(show details)
Symptom |
I/O error |
Environment |
Linux |
Trigger |
No pre-conditions are necessary. |
Workaround |
When using io_uring, use IORING_OP_READ instead of IORING_OP_READ_FIXED. This would require changing the application issuing the requests and might come at a performance penalty. |
|
5.1.2.4 |
Core GPFS |
IJ38808 |
Critical |
Lookup fails on AFM NSD backend fileset root path if afmSyncNFSv4ACL option is set. AFM incorrectly tries to get NFSv4 ACLs on the remote cluster mount causing the failure.
(show details)
Symptom |
Unexpected results |
Environment |
Linux |
Trigger |
Using the afmSyncNFSv4ACL option where there exists NSD backend filesets. |
Workaround |
Unset the afmSyncNFSv4ACL option. |
|
5.1.2.4 |
AFM |
IJ38874 |
Suggested |
Today there is no command to bring an AFM Inactive fileset to active.
(show details)
Symptom |
AFM fileset moving to Inactive/Dropped states. |
Environment |
All |
Trigger |
Fileset moving to Inactive state and needing recovery for any reason. |
Workaround |
Wait for an I/O operation on the fileset orotouch a file inside the fileset to simulate an incoming I/O and trigger recovery on the fileset in question. |
|
5.1.2.4 |
AFM |
IJ38901 |
Suggested |
When the handler for AFM replication is created on the gateway node, the handler create time, the last replay time and the last sync time are all initialized to now time. If for some reason the handler couldn't go mounted and replicate to Home, this leads to AFM printing the last replay time as the same time as handler create time and gives a misconception that replication has actually happened.
(show details)
Symptom |
Error output |
Environment |
Linux |
Trigger |
Checking AFM replication handler for last replay and sync time, when there's a recovery pending and not happening on the fileset. |
Workaround |
None |
|
5.1.2.4 |
AFM |
IJ37068 |
High Importance
|
The codepath for flushing file data to disk did not properly check for a stale file system, resulting in a crash.
(show details)
Symptom |
Abend/Crash |
Environment |
Linux |
Trigger |
With file descriptor open and kept open, have file system go stale (e.g. restart daemon). Then issue a request to flush the data to a file (or implicit flushOnClose). |
Workaround |
None |
|
5.1.2.4 |
Core GPFS |
IJ38963 |
High Importance
|
FM fileset resync failed with EINVAL error (22).
(show details)
Symptom |
I/O error |
Environment |
Linux |
Trigger |
AFM fileset resync operation (mmafmctl command with resync subcommand). |
Workaround |
None |
|
5.1.2.4 |
AFM |
IJ38964 |
High Importance
|
AFM Prefetch with --dir-list-file option where the list contains encoded directory names is not being processed and queued.
(show details)
Symptom |
Unexpected behavior. |
Environment |
Linux (AFM gateway node) |
Trigger |
Running prefetch (with or without --metadata-only option)using a list of encoded directory names - like the one generated from checkUncached (during mchfileset command run). |
Workaround |
Decode the directory list by hand and feed it to prefetch. |
|
5.1.2.4 |
AFM |
IJ38966 |
High Importance
|
When running IO through KNFS and file audit logging enabled, an invalid pointer might be accessed.
(show details)
Symptom |
Abend/Crash |
Environment |
Linux |
Trigger |
Certain patterns of KNFS IO with file audit logging enabled. |
Workaround |
None |
|
5.1.2.4 |
File audit logging |
IJ38286 |
Suggested |
If a listfile's first entry is a directory then all operation are terminated because startmarker failed to setup.
(show details)
Symptom |
Command failed with invalid entries. |
Environment |
Linux |
Trigger |
First entry of a list file is a directory in the --list-file option. |
Workaround |
None |
|
5.1.2.4 |
AFM |
IJ38986 |
Critical |
Kernel crash with kernel stack that shows the pemsIpmi functions. The RIP of the kernel crash shows RIP: 0010:kmem_cache_alloc_trace+0x7f/0x1c0.
(show details)
Symptom |
Kernel crash |
Environment |
Linux (x86_64) |
Trigger |
No special trigger. |
Workaround |
None |
|
5.1.2.4 |
ESS, GNR |
IJ38307 |
Suggested |
The given path for mmafmcosaccess doesn't check whether this path belongs to same fileset or not. Also it needs to check the FS and fileset consistency for the given command.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
The given path for the mmafmcoaccess command doesn't belong to same fileset but it is a valid path. |
Workaround |
None |
|
5.1.2.4 |
AFM |
IJ38997 |
Critical |
Cached file is not revalidated in AFM local-updates mode. If the file is modified at home, these changes might not get pulled back into the cache.
(show details)
Symptom |
Unexpected results |
Environment |
Linux |
Trigger |
File read on AFM cached file in LU mode. |
Workaround |
Use AFM prefetch with --force option to cache the file again. |
|
5.1.2.4 |
AFM |
IJ38998 |
High Importance
|
When afmSyncNFSv4ACL is set, ACL buffer size is not verified during the cache refresh. This causes the kernel to crash if the returned buffer length is zero.
(show details)
Symptom |
Crash |
Environment |
Linux |
Trigger |
AFM caching with afmSyncNFSv4ACL option. |
Workaround |
None |
|
5.1.2.4 |
AFM |
IJ39015 |
Suggested |
32bit GPFS API library not available in default path on Ubuntu.
(show details)
Symptom |
Error output/message |
Environment |
Linux (x86_64) |
Trigger |
Build an application with 32bit GPFS API library on Ubuntu. |
Workaround |
Modify the build process of the application to search for the 32bit GPFS API library in a different directory. |
|
5.1.2.4 |
GPFS API |
IJ39016 |
Suggested |
mmperfmon delete --expiredkeys fails with a timeout or exception.
(show details)
Symptom |
Error output/message |
Environment |
Linux |
Trigger |
Remote mounted filesystem with a slow or overloaded remote system. |
Workaround |
None |
|
5.1.2.4 |
Performance monitoring |
IJ39017 |
High Importance
|
Daemon assert going off: endBufOffset >= 0 && endBufOffset < codeP-> getBufMaxPayload(endBuf).
(show details)
Symptom |
Abend/Crash |
Environment |
Linux |
Trigger |
A media error is discovered and fixed on an IBM ESS 3200 system that is using Flash Core Module NVMe drives on a specific virtual track boundary. Not all media errors will causes this crash. |
Workaround |
None |
|
5.1.2.4 |
ESS, GNR |
IJ39019 |
Suggested |
Kernel crash when required mount options are missing.
(show details)
Symptom |
Abend/Crash |
Environment |
Linux |
Trigger |
Issue a mount request where the dev= option is missing. Either remove that from /etc/fstab, or issue a mount command that does not read options from /etc/fstab, e.g.: mount -t gpfs /gpfs/fs1 /gpfs/fs1 |
Workaround |
Always have the required dev= mount option available. This is the default in /etc/fstab. |
|
5.1.2.4 |
Core GPFS |
IJ39048 |
Suggested |
mmvdisk recovery group conversion may conflict with settings for nsdRAIDSmallBufferSize from the previous deployment scripts. mmvdisk will apply a value of -1 to this setting, which conflicts with the original value of 256KiB. The result is that the Daemon will print a warning message on start up, warning the user that nsdRAIDSmallBufferSize has been reduced to a value of 4KiB. This might impact performance.
(show details)
Symptom |
Error output/message, Performance Impact/Degradation |
Environment |
Linux |
Trigger |
mmvdisk recovery group conversion from the pre-2020 server config settings. |
Workaround |
Delete the old nsdRAIDSmallBufferSize setting of 256K in SDRFS, or delete any -1 values that were part of the mmvdisk rg conversion override. |
|
5.1.2.4 |
ESS, GNR |
IJ39049 |
Suggested |
When running mmhealth config monitor pause, followed by a mmhealth config monitor resume, the threshold component will stay in disabled state.
(show details)
Symptom |
Error output/message |
Environment |
All |
Trigger |
The issue occurs only if the node health monitoring was paused and resumed again. |
Workaround |
Execute the command "mmsysmonc enable thresholds". |
|
5.1.2.4 |
System health |
IJ39050 |
Critical |
On Linux (two instances), kernel crash may occur after open() with O_CREAT flag is used and file has been opened already.
(show details)
Symptom |
Kernel crash |
Environment |
Linux |
Trigger |
Using open() with O_CREAT flag on system with Linux kernel 3.10 or higher. |
Workaround |
Avoid using open() with O_CREAT flag. |
|
5.1.2.4 |
Core GPFS |
IJ39057 |
HIPER |
Files are not fully cached on AFM COS filesets.
(show details)
Symptom |
Unexpected results |
Environment |
Linux |
Trigger |
File read on AFM uncached files. |
Workaround |
Use AFM prefetch to cache the files again. |
|
5.1.2.4 |
AFM |
IJ39058 |
Suggested |
Certain filenames that contained control characters were not properly escaped when logged by File audit logging / watch Folder json format.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
Creating a file with control characters in the name. |
Workaround |
None |
|
5.1.2.4 |
File audit logging, Watch folder |
IJ39059 |
Suggested |
The '-' char is incorrectly used for a range between two values.
(show details)
Symptom |
It doesn't report issue. |
Environment |
Linux |
Trigger |
When invalid char like ';' is also accepted. |
Workaround |
None |
|
5.1.2.4 |
AFM |
IJ39060 |
Critical |
NFS status shown as 'unknown'. This might interfere with NFS fail over capabilities.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
None |
Workaround |
None |
|
5.1.2.4 |
NFS, System health |
IJ39117 |
High Importance
|
An error 22 is hit when trying to get the valid data blocks on a file in resync.
(show details)
Symptom |
Unexpected Behavior |
Environment |
Linux (AFM gateway node) |
Trigger |
Running resync with uncached (possibly evicted) files at the SW cache site. |
Workaround |
None |
|
5.1.2.4 |
AFM |
IJ39089 |
Suggested |
Ganesha crashed with below stack:
#012#5 0x00007f65b55a5e4e state_wipe_file (libganesha_nfsd.so.3.5)
#012#6 0x00007f65b567787c _mdcache_lru_unref (libganesha_nfsd.so.3.5)
#012#7 0x00007f65b56568e2 mdcache_put (libganesha_nfsd.so.3.5)
#012#8 0x00007f65b565adea mdcache_put_ref (libganesha_nfsd.so.3.5)
#012#9 0x00007f65b5619d73 open4_create_fh (libganesha_nfsd.so.3.5)
#012#10 0x00007f65b561c451 open4_ex (libganesha_nfsd.so.3.5)
#012#11 0x00007f65b561d6c0 nfs4_op_open (libganesha_nfsd.so.3.5)
#012#12 0x00007f65b5604cca process_one_op (libganesha_nfsd.so.3.5)
#012#13 0x00007f65b5605d46 nfs4_Compound (libganesha_nfsd.so.3.5)
#012#14 0x00007f65b555b99c nfs_rpc_process_request (libganesha_nfsd.so.3.5)
(show details)
Symptom |
Ganesha Crash |
Environment |
All |
Trigger |
The problem might occur if there are a lot of small files with the same filename created/deleted from nfs clients at the same time. |
Workaround |
None |
|
5.1.2.4 |
cNFS, CES NFS (All instances in feature tags) |
IJ39119 |
Suggested |
Ganesha logs below messages.
2022-03-11 14:28:22 : epoch 0009016d : protocol2b :
gpfs.ganesha.nfsd-14806[svc_37] GPFSFSAL_lookup :
FSAL :CRIT :DOTDOT error, inode: 4308074499
2022-03-11 14:28:32 : epoch 0009016d : protocol2b :
gpfs.ganesha.nfsd-14806[svc_48] GPFSFSAL_lookup :
FSAL :CRIT :DOTDOT error, inode: 4308074499
(show details)
Symptom |
DOTDOT error message in ganesha.log |
Environment |
All |
Trigger |
The problem might trigger if snapshot directory exists and its parent directory have the same inode number. |
Workaround |
None |
|
5.1.2.4 |
cNFS, CES NFS (All instances in feature tags) |
IJ39148 |
Suggested |
NFS mount point is not getting killed if home fileset is unresponsive or hung. This is causing multiple nfsmount to be created for the same fileset.
(show details)
Symptom |
Too much memory consumption on the NFS mount point. |
Environment |
Linux |
Trigger |
Gateway node is getting more memory consumption on the nfsmount due to existing multiple mount points of the fileset. |
Workaround |
None |
|
5.1.2.4 |
AFM DR |
IJ39201 |
Suggested |
Watch folder events could show an old path to a file if a directory in it's path had recently been renamed.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
Rename directories being watched. |
Workaround |
None |
|
5.1.2.4 |
Watch folder |
IJ36899 |
High Importance
|
If the /etc/passwd file has multiple entries for the same UID, readdir fails while downloading the objects due to incorrect parsing of the UID.
(show details)
Symptom |
Unexpected results |
Environment |
Linux |
Trigger |
AFM+COS caching with duplicate entries in the passwd file |
Workaround |
Remove duplicate entries from /etc/passwd |
|
5.1.2.4 |
AFM |
IJ39203 |
Suggested |
mmafmcoskeys failed to set access and secret keys.
(show details)
Symptom |
Access and secret keys fail to set. |
Environment |
Linux |
Trigger |
Trying to set access and secret keys. |
Workaround |
None |
|
5.1.2.4 |
AFM COS |
IJ39274 |
High Importance
|
In huge clusters (lot of perfomance data) and on systems with high load on the pmcollector / GUI node, performance queries might run into a 5s timeout This could lead to missing data in the GUI.
(show details)
Symptom |
Component Level Outage |
Environment |
Linux |
Trigger |
Huge clusters (lot of perfomance data) and on systems with high load on the pmcollector / GUI node |
Workaround |
None |
|
5.1.2.4 |
Performance monitoring, GUI |
IJ39280 |
Suggested |
After refresh interval, cache bit is getting reset while getobjmetats is triggered on cached file and finding ETAG mismatches.
(show details)
Symptom |
Files get evicted. |
Environment |
All |
Trigger |
Files get evicted because cache bit gets reset. |
Workaround |
None |
|
5.1.2.4 |
AFM COS |
IJ39282 |
Critical |
AFM fails to upload the object if the name starts with a '-' character.
(show details)
Symptom |
Deadlock |
Environment |
Linux |
Trigger |
AFM+COS caching with special file names. |
Workaround |
None |
|
5.1.2.4 |
AFM |
IJ39283 |
Suggested |
If the system pool is also used for data, auto recovery mis-calculates avaiable metadata fg count and may trigger tsrestripefs -r wrongly.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
If the system pool is used for both data and metadata in a FPO cluster and if a disk/node failure causes the good failure group count to become less than the default metadata replication. |
Workaround |
Do not use system pool for data. |
|
5.1.2.4 |
FPO |
IJ39284 |
Critical |
Deadlock might happen when the AFM gateway node leaves the cluster.
(show details)
Symptom |
Deadlock |
Environment |
Linux |
Trigger |
AFM gateway node leaving the cluster. |
Workaround |
None |
|
5.1.2.4 |
AFM |
IJ39316 |
Suggested |
Disk quota error is not reported when a readdir is happening at fileset root.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
readdir on a fileset. |
Workaround |
None |
|
5.1.2.4 |
AFM COS |
IJ39371 |
Critical |
Stack corruption due to possible buffer overflow.
(show details)
Symptom |
mmfsd restart |
Environment |
Linux |
Trigger |
mmfsd restart at AFM gateway node. |
Workaround |
None |
|
5.1.2.4 |
AFM |
IJ39011 |
High Importance
|
Online replica compare function could incorrectly flag mismatch on the last block of a file when the block was preallocated as a full block and reduced to fragment later.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
All |
Trigger |
Run online replica compare on files with preallocated blocks. |
Workaround |
Avoid running online replica compare. |
|
5.1.2.4 |
Core GPFS |
IJ39400 |
Suggested |
The IBM Spectrum Scale admin commands and handling of file system encryption keys require the use of more robust settings.
(show details)
Symptom |
None |
Environment |
All |
Trigger |
None |
Workaround |
None |
|
5.1.2.4 |
Admin commands |
IJ39415 |
High Importance
|
GPFS recovery is blocked after cables are pulled and put back, due to a RPC being sent while taking GPFS dumps.
(show details)
Symptom |
Hang |
Environment |
Linux (ESS systems) |
Trigger |
Pull cables and then put the cables back. |
Workaround |
None |
|
5.1.2.4 |
Core GPFS |
IJ39437 |
Suggested |
Command mmlspdisk produces printf arithmetic syntax under non-US locale.
(show details)
Symptom |
Error output/message |
Environment |
All |
Trigger |
Run mmlspdisk under locale that uses decimal comma. |
Workaround |
Run mmlspdisk in C or en_US locale. |
|
5.1.2.4 |
Admin commands |
IJ39438 |
High Importance
|
In huge clusters (lot of perfomance data) and on systems with a high load on the pmcollector / GUI node, perfomance queries might run into a 5s timeout. This could lead to missing data in the GUI.
(show details)
Symptom |
Component Level Outage |
Environment |
Linux |
Trigger |
Huge clusters (lot of perfomance data) and on systems with high load on the pmcollector / GUI node |
Workaround |
None |
|
5.1.2.4 |
Performance monitoring, GUI |
IJ39440 |
High Importance
|
Signal 11 in fetch_and_add() on nsdHoldCount.
(show details)
Symptom |
Abend/Crash |
Environment |
All |
Trigger |
Disk scan. |
Workaround |
None |
|
5.1.2.4 |
NSD |
IJ39449 |
High Importance
|
Pems hang due to no ipmi recv slots, so no new ipmi command will be sent to BMC. When pems hangs, it will generate these lines at dmesg or /var/log/messages
ERROR: no open ipmi recv slots
pems_mod:[E]:0136:0581:failed to enq cmd rc=0xfffffff0
pemsIpmiEnqueueCmd failed to enq setting QUEUE_FULL
pems_mod:[E]:0136:0315:failed to send
cmd to backend interface rc=-16
pems_mod:[E]:0136:0581:failed to enq cmd rc=0xfffffff0
You will see the last 2 prints over and over.
(show details)
Symptom |
pems hang generating a lot messages at dmesg. |
Environment |
Linux (x86_64) |
Trigger |
It is a small hole at pems ipmi receive handler that it can happen at any time in ESS3200. |
Workaround |
Restart pems module and restart the ess3200_pemscfg service. |
|
5.1.2.4 |
ESS, GNR |
IJ39455 |
Suggested |
Remove from displaying and prevent adding un-supported ciphers to cipherList. The following ciphers are affected:
AES128-SHA
AES256-SHA
(show details)
Symptom |
None |
Environment |
All |
Trigger |
Use un-supported ciphers. |
Workaround |
Don't use unsupported ciphers. |
|
5.1.2.4 |
Admin commands |
IJ37100 |
Suggested |
The output of "mmperfmon query" gives incomplete data if the names contain a blank.
(show details)
Symptom |
Error output/messages |
Environment |
All (with perfomance monitoring installed) |
Trigger |
The broken text appears for entries containing blanks. |
Workaround |
None |
|
5.1.2.3 |
System health |
IJ37227 |
High Importance
|
Daemon assert going off when generating DMAPI event: addr.isReserved() || addr.getClusterIdx() == clusterIdx in file cfgmgr.h, resulting in a daemon crash.
(show details)
Symptom |
Abend/Crash |
Environment |
All |
Trigger |
DMAPI is enabled and a remote cluster is used while a DMAPI event is being generated after a remote client node left the cluster. |
Workaround |
None |
|
5.1.2.3 |
DMAPI |
IJ37231 |
Suggested |
If NFSv4 client holds a file lock for read/write operations, then client may report I/O error after CES-IP failover.
(show details)
Symptom |
I/O error |
Environment |
All |
Trigger |
If NFSv4 client holds a file lock for write operation, then CES-IP failover from current active NFS server(lets say protocol node1) to other server (protocol node2) may cause I/O failure on client. |
Workaround |
None |
|
5.1.2.3 |
|
IJ37235 |
High Importance
|
Missing sqlite-3 packages on IBM Spectrum Scale Erasure Code Edition environments can cause admin command hangs.
(show details)
Symptom |
Hang |
Environment |
All |
Trigger |
Problem occurs in an IBM Spectrum Scale Erasure Code Edition environment when the sqlite-3 package is installed on some nodes but not others. |
Workaround |
None |
|
5.1.2.3 |
Admin commands |
IJ37246 |
Suggested |
EPERM is incorrectly returned for non-existing ioctl requests.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
Issuing an invalid ioctl request to a file in GPFS. |
Workaround |
NA |
|
5.1.2.3 |
Core GPFS |
IJ37256 |
Critical |
There is a chance of a kernel crash with kernel stack with pemsIpmi functions. The RIP of the kernel crash may show RIP: 0010:kmem_cache_alloc_trace+0x7f/0x1c0.
(show details)
Symptom |
Kernel crash |
Environment |
Linux (x86_64) |
Trigger |
No specific trigger; issue occured in normal good path run. |
Workaround |
None |
|
5.1.2.3 |
ESS, GNR |
IJ36533 |
High Importance
|
Discrepancy quota usage from fileset based quota check.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
All |
Trigger |
Fileset level quota check |
Workaround |
Switch to file system level quota check. |
|
5.1.2.3 |
Quotas |
IJ37260 |
High Importance
|
Running workloads with many lookups done to GPFS in a highly concurrent way has a performance impact.
(show details)
Symptom |
Performance Impact/Degradation |
Environment |
Linux |
Trigger |
Lots of lookups to the file system. One known case is setting LD_LIBRARY_PATH to directories on a GPFS file system on zLinux. The zLinux dynamic linker issues a much higher number of lookups for each entry in LD_LIBRARY_PATH, making this scenario more likely to occur. |
Workaround |
Reduce the number of concurrent lookups. |
|
5.1.2.3 |
Core GPFS |
IJ36554 |
Suggested |
In fileset level mmcheckquota, if no free inode is left in the fileset (all inodes are allocated), when calculating the max inode number for the fileset, the last inode number is miss counted, which causes 1 inode usage discrepancy for the fileset quota.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
All |
Trigger |
Fileset level quota check |
Workaround |
Switch to a file system level quota check. |
|
5.1.2.3 |
Quotas |
IJ37280 |
High Importance
|
The assert goes off and the following message is shown in the mmfs.log: Assert exp(isUnlinked() || DaemonShuttingDown)
(show details)
Symptom |
Abend/Crash |
Environment |
All |
Trigger |
Sender timeout while sending an RPC |
Workaround |
None |
|
5.1.2.3 |
Core GPFS |
IJ36532 |
High Importance
|
When there are multiple threads trying to flush the same file and the file is large with many blocks, there could be mutex contention which can lead to performance degradation.
(show details)
Symptom |
Performance Impact/Degradation |
Environment |
All |
Trigger |
Multiple threads trying to flush the same large file. |
Workaround |
Reduce the number of worker threads. |
|
5.1.2.3 |
Core GPFS |
IJ37350 |
High Importance
|
AFM prefetch might get stuck during the queuing phase if the list file has duplicate entries. This happens because a waiting thread is not notified after the read completion.
(show details)
Symptom |
Deadlock |
Environment |
Linux |
Trigger |
AFM prefetch |
Workaround |
Remove duplicate entries from the list file. |
|
5.1.2.3 |
AFM |
IJ37356 |
High Importance
|
Inodes are not reclaimed after the hardlinks are corrected during the AFM prefetch. This causes more inodes to be in-use than actual number of files present in the fileset.
(show details)
Symptom |
Unexpected results |
Environment |
Linux |
Trigger |
AFM prefetch |
Workaround |
None |
|
5.1.2.3 |
AFM |
IJ37511 |
Suggested |
An error message "Could not retrieve minReleaseVersion" is logged in the systemhealth monitor log file (mmsysmonitor.log).
(show details)
Symptom |
Error output/messages |
Environment |
All (with performance monitoring installed) |
Trigger |
The error message is logged whenever a mmperfmon query is executed. |
Workaround |
None; The error message can be ignored. |
|
5.1.2.3 |
System health |
IJ37542 |
Critical |
On Linux kernel 3.10 or later, if the O_TRUNC flag is used and the file has been opened already, the O_TRUNC flag might be incorrectly ignored.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
Using open() with O_CREAT and O_TRUNC flags on a system with Linux kernel 3.10 or later. |
Workaround |
Avoid using open() with O_CREAT and O_TRUNC flags. |
|
5.1.2.3 |
Core GPFS |
IJ37679 |
High Importance
|
mmbackup uses IBM Spectrum Protect BA client command 'dsmc' to communicate with the IBM Spectrum Protect Server. If the -server option is not given to dsmc, dsmc gets a default server name from the Protect client configuration. If --tsm-servers <servername> to mmbackup is different from the default server name and the default server is not functional, mmbackup could show unexpected behavior because mmbackup does not provide the -server option in one of the dsmc query calls.
(show details)
Symptom |
Component Level Outage |
Environment |
All |
Trigger |
Run mmbackup when --tsm-servers <server> is not the same as the default servername in dsm.opt and default server is not functional. |
Workaround |
Make sure that the --tsm-servers <server> is the same as the default servername in dsm.opt. |
|
5.1.2.3 |
mmbackup |
IJ37747 |
Suggested |
When adding a new disk to a file system, health monitoring will raise an ill_unbalanced_fs degraded health event as the file system will be unbalanced. This degraded health event does not reflect the current recommendation of when to use the mmrestripefs command, and so the degraded health event's severity is to sever and is being changed from a degraded severity level to being a TIP severity level.
(show details)
Symptom |
Error output/messages |
Environment |
All |
Trigger |
Adding a new disk to an existing file system. |
Workaround |
The ill_unbalanced_fs event can be added to the "ignore events" list in the mmsysmonitor.conf file. After mmsysmon is restarted, this event will be ignored by mmhealth and will not cause any unbalanced file systems to show as being degraded. |
|
5.1.2.3 |
System health |
IJ37784 |
Suggested |
When a fileset is in chmodAndUpdateAcl permission change mode, creating a file with the open() system call under a parent directory with inherit entries and changing permissions of the newly created file with NFS results in duplicated and incorrect entries in the file's NFSv4 ACL.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
All |
Trigger |
Have a fileset in chmodAndUpdateAcl permission change mode and a parent directory with inherit entries. Using NFS, create a file with the open() system call and change the permissions of the file with chmod. |
Workaround |
Use chmodAndSetAcl permission change mode for filesets and avoid having inherit entries in the parent directory. |
|
5.1.2.3 |
NFS |
IJ37493 |
High Importance
|
Rename fails with error 766 on AFM+COS fileset if the file is moved from a local directory to a non-local directory.
(show details)
Symptom |
Unexpected results |
Environment |
Linux |
Trigger |
Renaming local file to non-local directory. |
Workaround |
None |
|
5.1.2.3 |
AFM COS |
IJ37870 |
High Importance
|
With afmFastCreate enabled on IW fileset, AFM recovery fails.
(show details)
Symptom |
Unexpected Behavior |
Environment |
Linux (AFM gateway nodes) |
Trigger |
Running Recovery on AFM IW mode filesets with afmFastCreate enabled and changes being made at cache and home simultaneously. |
Workaround |
None |
|
5.1.2.3 |
AFM |
IJ37787 |
High Importance
|
SGNotQuiesced assertion in dbshLockInode during file system quiesce.
(show details)
Symptom |
Abend/Crash |
Environment |
All |
Trigger |
Operations which do file system quiesce |
Workaround |
None |
|
5.1.2.3 |
Snapshots |
IJ37104 |
High Importance
|
POSIX permission denied program error
(show details)
Symptom |
Permission denied on open for new file |
Environment |
Linux |
Trigger |
A file mode for creation which is not correctly translated by newer kernels. |
Workaround |
None |
|
5.1.2.3 |
API |
IJ37790 |
Suggested |
Trying to add char '=' and '-' in akey/skey is failing with invalid key.
(show details)
Symptom |
Failed with invalid key |
Environment |
Linux |
Trigger |
Setting up the skey/akey in the mmafmcoskey command. |
Workaround |
None |
|
5.1.2.3 |
AFM |
IJ37838 |
High Importance
|
mmap reads from lots of threads may cause a deadlock in DeclareResourceUsage.
(show details)
Symptom |
Hang/Deadlock/Unresponsiveness/Long Waiters |
Environment |
All |
Trigger |
mmap reads from lots of threads |
Workaround |
Disable mmap pagepoolresource usage declaration by the "mmchconfig mmapDeclarePageUsage=false" command. |
|
5.1.2.3 |
Core GPFS |
IJ37854 |
Suggested |
When SGPanic occurs, the dealloc queue subblocks count could be wrong and cause "(deallocHighSeqNum - deallocFlushedSeqNum) >= deallocQueueSubblocks" assertion failure.
(show details)
Symptom |
Abend/Crash |
Environment |
All |
Trigger |
In rare case, the block deallocation around SGPanic time might cause this assertion. |
Workaround |
None |
|
5.1.2.3 |
Core GPFS |
IJ37882 |
High Importance
|
Due to a change in procps output in Cygwin version 3.3, IBM Spectrum Scale fails to start.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Windows (x86_64) |
Trigger |
IBM Spectrum Scale startup |
Workaround |
Downgrade Cygwin. |
|
5.1.2.3 |
Core GPFS |
IJ35881 |
High Importance
|
While trying to set extended attributes, SetXAttrHandlerThread could deadlock with itself trying to obtain a WW lock on the buffer while holding XW lock.
(show details)
Symptom |
Hang/Deadlock/Unresponsiveness/Long Waiters |
Environment |
All |
Trigger |
Changing extended attributes on a file or directory |
Workaround |
None |
|
5.1.2.2 |
Core GPFS |
IJ36110 |
High Importance
|
AFM does not allow the character '=' as part of a secret key.
(show details)
Symptom |
Error message |
Environment |
Linux |
Trigger |
Using special characters as part of a secret key |
Workaround |
None |
|
5.1.2.2 |
AFM |
IJ36246 |
Suggested |
When running file audit logging, signal 11 is possible at FileMetadata::set_mtimeUpdate(unsigned int)
(show details)
Symptom |
Signal 11 |
Environment |
Linux |
Trigger |
Daemon crash |
Workaround |
None |
|
5.1.2.2 |
File audit logging |
IJ36250 |
High Importance
|
On HAWC enabled file systems, a deadlock could occur when a data block is being modified at the same time as log wrap is working on log records for the same block.
(show details)
Symptom |
Hang/Deadlock/Unresponsiveness/Long Waiters |
Environment |
All |
Trigger |
Multiple writes to the same block on a HAWC enabled file system. |
Workaround |
Disable HAWC feature on the file system. |
|
5.1.2.2 |
HAWC |
IJ36299 |
Suggested |
If the number of quorum nodes in the cluster is not greater than the minQuorumNodes configure setting, the mmchconfig command fails without a clear message.
(show details)
Symptom |
Error message |
Environment |
All |
Trigger |
Problem arises when minQuorumNodes configure value is greater than or equal to the number of quorum nodes in the cluster. |
Workaround |
If setting the tiebreakerDisks parameter fails because the number of quorum nodes in the cluster is not greater than minQuorumNodes, use the mmchconfig command to set minQuorumNodes to the default value or a value lower than the number of quorum nodes in the cluster. |
|
5.1.2.2 |
Admin commands |
IJ36462 |
Suggested |
Failed to create the RG.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
All |
Trigger |
Using mmvdisk to create RG, and there is an SSD SATA disk. |
Workaround |
1. Before the creation of the RG, run 'chmod a-x /usr/lpp/mmfs/bin/gems/tscompattr' 2. Once the RG is created, run 'chmod a+x /usr/lpp/mmfs/bin/gems/tscompattr' |
|
5.1.2.2 |
ESS, GNR |
IJ36511 |
Suggested |
Certain characters such as newline (\n) or backslash (\), etc were not escaped correctly resulting in invalid JSON. JSON parsers are not be able to read the event correctly.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
Filenames, acls, or xattrs with escape characters |
Workaround |
You can programmatically escape existing events to create valid JSON before the parser tries to ingest the event. |
|
5.1.2.2 |
File audit logging, Watch folder |
IJ36512 |
Suggested |
If a workload involves opening and creating lots of files concurrently under the same directory, some of the open operations may suffer high open times.
(show details)
Symptom |
Performance Impact/Degradation |
Environment |
Windows (x86_64) |
Trigger |
Workload that creates and opens many file concurrently in the same directory path. |
Workaround |
None |
|
5.1.2.2 |
Core GPFS |
IJ36513 |
Suggested |
Assert exp(ecDataBuffersPerTrack+ecParityBuffersPerTrack == ecParityBufferIndexByStrip[ecNPdisks])
(show details)
Symptom |
Abend/Crash |
Environment |
Linux |
Trigger |
LG resign |
Workaround |
None |
|
5.1.2.2 |
ESS, GNR |
IJ36529 |
Suggested |
While down loading the files without afmObjectACL enabled, its taking the default permission 700 which is getting a mismatch with the default permission 770 of file set root.
(show details)
Symptom |
Default permission for files does not match with fileset root. |
Environment |
Linux |
Trigger |
Non consistent default permission value across the fileset. |
Workaround |
None |
|
5.1.2.2 |
AFM |
IJ36531 |
Suggested |
The position of the preventSnapshotRestore value is incorrectly read while loading the mmbackupconfig file. The position is off by four values. The correct information is saved from mmbackupconfig.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
mmbackupconfig was run on 5.1.2 with watch folder or file audit logging enabled and mmrestoreconfig is being run that would restore the associated filesets of watch folder / file audit logging. |
Workaround |
Run mmrestoreconfig again once the system is updated to a release with the fix. |
|
5.1.2.2 |
mmbackupconfig, file audit logging, watch folder |
IJ36709 |
High Importance
|
AFM directory prefetch fails to populate hardlinks, this causes hardlinks to be created as different files at the cache.
(show details)
Symptom |
Unexpected results |
Environment |
Linux |
Trigger |
AFM prefetch |
Workaround |
None |
|
5.1.2.2 |
AFM |
IJ36563 |
High Importance
|
When AFM COS replication is happening on any one of the filesets in the file systems, if there is any other fileset that is attempting to link/unlink or create/delete a snapshot, then there can be a deadlock.
(show details)
Symptom |
Deadlock |
Environment |
Linux (AFM gateway node) |
Trigger |
Create/delete snapshot on a fileset or link/unlink a fileset on a file system where one or more AFM COS filesets are replicating to the remote COS site. |
Workaround |
None |
|
5.1.2.2 |
AFM |
IJ36818 |
High Importance
|
There has been a vulnerability found in Apache Log4j2 library v2.16.0 used by Scale/ESS GUI. Apache Log4j2 versions 2.0-alpha1 through 2.16.0 (excluding 2.12.3) did not protect from uncontrolled recursion from self-referential lookups.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
All |
Trigger |
Third Party Advisory released by Apache |
Workaround |
None |
|
5.1.2.2 |
Core GPFS |
IJ36855 |
High Importance
|
The IBM Spectrum Scale HDFS Transparency connector version 3.1.0-9, 3.1.1.7 and 3.3.0-0 contain Apache Log4j libraries that are affected by the security vulnerabilities CVE-2019-17571 and CVE-2021-4104.
(show details)
Symptom |
NA |
Environment |
All |
Trigger |
The IBM Spectrum Scale HDFS Transparency connector is not vulnerable in default configurations. |
Workaround |
Manually patch affected log4j libraries. |
|
5.1.2.2 |
HDFS Connector |
IJ36349 |
Critical |
GPFS daemon could assert while running mmadddisk. This can only happen if a new storage pool is being created as a result of running mmadddisk and a storage pool had been deleted in the past via mmdeldisk.
(show details)
Symptom |
Abend/Crash |
Environment |
All |
Trigger |
Creating a new storage pool with the mmadddisk command. |
Workaround |
Increase number of disks being added with the mmadddisk command or avoid creating a new storage pool. |
|
5.1.2.2 |
Core GPFS |
IJ36895 |
High Importance
|
"More than 22 minutes searching for a free buffer in the pagepool" assertion failure.
(show details)
Symptom |
Abend/Long Waiters |
Environment |
All |
Trigger |
This problem is more likely to occur in a cluster which has file systems with both large block size and small block size (compared to scatter buffer size) |
Workaround |
Change 'scatterBufferSize' config to a smaller size. |
|
5.1.2.2 |
Core GPFS |
IJ35443 |
Suggested |
There are regular error messages '/sbin/ibportstate: Failed to open' in the systemhealth monitor log.
(show details)
Symptom |
Error output/message |
Environment |
All |
Trigger |
There are regular error messages '/sbin/ibportstate: Failed to open' in the systemhealth monitor log. |
Workaround |
None |
|
5.1.2.1 |
System health |
IJ35444 |
High Importance
|
AFM Independent filesets with dependent filesets linked inside them have a chance of hitting a deadlock.
(show details)
Symptom |
Deadlock |
Environment |
Linux (AFM gateway nodes) |
Trigger |
Trigger relationship initialization on an AFM independent fileset with dependent filesets inside them. At the same time, the remote site being bad causing the relationship to be put into a bad state. |
Workaround |
None |
|
5.1.2.1 |
AFM |
IJ35448 |
Suggested |
Ubuntu machines are reported with a network health issue of "ib_rdma_libs_wrong_path", even when the required libraries are installed.
(show details)
Symptom |
Error output/message |
Environment |
Ubuntu (using Infiniband/RDMA) |
Trigger |
Since Debian/Ubuntu introduced Multiarch Architecture Specifiers, most libraries live in a /usr/lib/XXXX-linux-gnu/ directory (where XXXX describes the architecture). The initial check procedure considered only the usual library paths, like /usr/lib64 and /usr/lib. |
Workaround |
None |
|
5.1.2.1 |
System health |
IJ35449 |
Suggested |
When running tail -f on an audit log from a node that is not the writing node, tail -f will not show newly written events.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
Running tail -f on an audit log. |
Workaround |
None |
|
5.1.2.1 |
File audit logging |
IJ35466 |
Suggested |
When a subfolder or audit log is created under the File Audit Logging fileset, it inherits a default selinux security context. This default value does not allow applications such as rsyslog the ablility to read the audit log contents.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
NA |
Workaround |
NA |
|
5.1.2.1 |
File audit logging |
IJ35486 |
Suggested |
logAssertFailed: exp(vrsP->index == index)
(show details)
Symptom |
Abend/Crash |
Environment |
Linux |
Trigger |
This assertion may occur when GPFS detects and breaks a hung RDMA request. |
Workaround |
None |
|
5.1.2.1 |
RDMA |
IJ35487 |
High Importance
|
mmlsquota is reporting wrong results with: 1. extra output lines with "no limits" for users or groups that don't have usage on the fileset 2. extra output lines, all showing "no limits" when no limits (quotas) are set for a user or group in the fileset
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
All |
Trigger |
Issuing mmlsquota when perfileset-quota is enabled. |
Workaround |
None |
|
5.1.2.1 |
Quotas |
IJ35318 |
Suggested |
IBM Spectrum Scale ships several ilm samples. One of them is the mmfind tool and to use the tool, findUtil_processOutputFile.c needs to be compiled. But the compilation of findUtil_processOutputFile.c fails on some Linux distributions.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
Compiling mmfindUtil_processOutputFile.c |
Workaround |
Modify mmfindUtil_processOutputFile.c before compiling it. |
|
5.1.2.1 |
Admin commands |
IJ35537 |
High Importance
|
A newly mounting node either due to user mount or an expelled node rejoining the cluster can fail assert 'llfP->lockRangeNode != NodeAddr(-1U, 0, NodeAddr::naNormal)' if it happens in the middle of an mmrestripefs, mmaddisk, mmdeldisk, or mmfsck operation.
(show details)
Symptom |
Node expel/Lost Membership |
Environment |
All |
Trigger |
Mounted node failure in the middle of mmrestripefs. |
Workaround |
None |
|
5.1.2.1 |
Core GPFS |
IJ35567 |
Suggested |
When using RDMA via RoCE, there are certain network error scenarios where not all possible RDMA connections from a NSD client to a NSD server are established.
(show details)
Symptom |
Network Performance |
Environment |
Linux |
Trigger |
- the NSD server port has no IP address assigned. - RDMA Connection Manager address or route resolution fails. - RDMA Connection Manager connection request fails. |
Workaround |
None |
|
5.1.2.1 |
RDMA |
IJ35578 |
Suggested |
When GDS is disabled, the RDMA subsystem may post GDS related error messages even though everything is working correct.
(show details)
Symptom |
Documentation Problem |
Environment |
Linux |
Trigger |
- GPU Direct Storage support is disabled. - libmlx5.so is not installed on the system or libmlx5.so is downlevel. |
Workaround |
None |
|
5.1.2.1 |
RDMA |
IJ35598 |
Critical |
GPFS API calls from 32-bit application fail on SLES 15 SP3.
(show details)
Symptom |
Error output/message |
Environment |
Linux (x86_64 and s390x) |
Trigger |
Running on SLES 15 SP3 and an application trying to issue 32-bit GPFS API calls. |
Workaround |
Apply the fix manually by editing the file /usr/lpp/mmfs/src/gpl-linux/ss.c to remove the checks for HAVE_COMPAT_IOCTL, then run mmbuildgpl again, and restart GPFS. |
|
5.1.2.1 |
GPFS API |
IJ35140 |
High Importance
|
Daemon crash getting AFM statistics from the mmdiag command.
(show details)
Symptom |
mmfsd daemon crash |
Environment |
Linux |
Trigger |
AFM stats collection using the mmdiag command |
Workaround |
Reset AFM stat counter frequently using the mmdiag command. |
|
5.1.2.1 |
AFM |
IJ35622 |
Critical |
If there are node failures during burst of file create or delete activity, then it is possible for the cached free inode counters on the file system manager to become out of date.
(show details)
Symptom |
Error output/message |
Environment |
All |
Trigger |
Node failures in the middle of large number of file creates or deletes |
Workaround |
Run 'mmfsadm test imapWork <fs> inodeManager' or 'mmchmgr <fs> <another node>'. |
|
5.1.2.1 |
Core GPFS |
IJ35686 |
Suggested |
Getattr failed to perform file validation with home if afmObjectXattr flag is enabled and due to this it is unable to refresh the metadata of files at cache on home file changes as part of lookup in LU mode.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
Metadata mismatches on a afmObjectXattr enabled fileset |
Workaround |
None |
|
5.1.2.1 |
AFM |
IJ35789 |
High Importance
|
When a single node is unavailable during 'mmauth genkey new', it results in GPFS (mmfsd) not starting on this node.
(show details)
Symptom |
GPFS does not start |
Environment |
All |
Trigger |
Issuing 'mmauth genkey new' |
Workaround |
To update the node which was unavailable during 'mmauth genkey new' with the latest key files the following command must be attempted on a node which was available during 'mmauth genkey new': mmauth genkey propagate -N <NODE_UNAVAILABLE_DURING_MMAUTH_GENKEY_NEW> |
|
5.1.2.1 |
GPFS startup, Admin commands, CCR |
IJ35792 |
Suggested |
When using 'mmqos class delete', there is a check to prevent deleting of a class that is referenced or used by a throttle object. The current error does not make this clear.
(show details)
Symptom |
Error output/message |
Environment |
Linux |
Trigger |
Throttle objects that use the class you want to delete |
Workaround |
Remove any throttle objects that use the class you want to delete. |
|
5.1.2.1 |
QoS |
IJ35751 |
Critical |
AFM gateway node crashes during the fileset recovery because invalid file handle are used to get inodes in the kernel.
(show details)
Symptom |
Crash |
Environment |
Linux |
Trigger |
AFM fileset recovery |
Workaround |
None |
|
5.1.2.1 |
AFM |
IJ35795 |
High Importance
|
Triggering a ChangeSecondary for a DR Primary mode fileset to the same target inband or triggering a Resync on a SW fileset which is in unmounted state, with resyncV2 feature enabled can cause the resync/changeSecondary to fail and not proceed.
(show details)
Symptom |
Unexpected Behavior |
Environment |
Linux (AFM gateway nodes) |
Trigger |
Triggering a ChangeSecondary for DR Primary mode fileset to the same target inband or triggering a Resync on a SW fileset which is in unmounted with resyncV2 feature enabled. |
Workaround |
Disable ResyncV2 and run ResyncV1 to get changeSecondary or Resync commands to work on the DR/SW filesets. |
|
5.1.2.1 |
AFM |
IJ35791 |
High Importance
|
When IO workload is running, such as NSD read on a general GPFS, ECE, or ESS, the TCP connection may be incorrectly reset. If all the connections to the peer node are reset, this will trigger the node to be expelled.
(show details)
Symptom |
Node expel/Lost Membership |
Environment |
All |
Trigger |
Large IO read in progress |
Workaround |
None |
|
5.1.2.1 |
Core GPFS |
IJ35796 |
Critical |
Slow readdir and lookup performance on AFM caching mode filesets under heavy workload
(show details)
Symptom |
Slow IO |
Environment |
Linux |
Trigger |
AFM caching with heavy workload |
Workaround |
Restart AFM gateway node. |
|
5.1.2.1 |
AFM |
IJ35797 |
Suggested |
Sometimes stealing threads are not started in time to steal buffers for I/O threads which may degrade performance.
(show details)
Symptom |
|
Environment |
Linux |
Trigger |
The problem may be triggered with heavy I/O workload. |
Workaround |
Remove any throttle objects that use the class you want to delete. |
|
5.1.2.1 |
ESS, GNR |
IJ35809 |
Suggested |
When there is no mmqos configuration and the command 'mmqos report list -Y' is run, it shows mmlsqos instead of mmqos in the output.
(show details)
Symptom |
Error output/message |
Environment |
Linux |
Trigger |
No mmqos data configured |
Workaround |
NA |
|
5.1.2.1 |
QoS |
IJ35838 |
Suggested |
When the last block of a file is not a full GPFS block, replica compare function could report false replica mismatch.
(show details)
Symptom |
Error output/message |
Environment |
All |
Trigger |
Running replica compare with mmrestripefs or mmrestripefile. |
Workaround |
None |
|
5.1.2.1 |
Core GPFS |
IJ35851 |
Suggested |
Customer may run the cluster with unsupported quorum/tiebreaker disk configuration.
(show details)
Symptom |
Cluster runs with unsupported quorum/tiebreaker disk configuration |
Environment |
Linux |
Trigger |
Unsupported quorum/tiebreaker disk configuration |
Workaround |
None |
|
5.1.2.1 |
Core GPFS |
IJ35941 |
High Importance
|
When IO workload is running, such as NSD read on a general GPFS, ECE, or ESS, the TCP connection may be incorrectly reset. If all the connections to the peer node are reset, this will trigger the node to be expelled.
(show details)
Symptom |
Node expel/Lost Membership |
Environment |
All |
Trigger |
Large IO read in progress |
Workaround |
None |
|
5.1.2.1 |
Core GPFS |
IJ36065 |
High Importance
|
In a mixed cluster which contain 5.1.2.0 and pre-5.1.2.0 nodes, if a quota function is enabled on a file system with a format version that is lower than 4.1.1.0, the GPFS daemon on the quota client node may crash with signal 11. The following dump stack is shown in mmfs.log: 2021-11-04_14:43:19.968+0100: [E] Signal 11 at location 0x55D8E19220A5 in process 28081, link reg 0xFFFFFFFFFFFFFFFF.2021-11-04_14:43:20.867+0100: [D] Traceback: 2021-11-04_14:43:20.868+0100: [D] #0: 0x000055D8E19220A5 QuotaClient::sendQuotaShareRequest(QuotaEntryClt*, QuotaShare*, unsigned int, unsigned int, unsigned int*, int, long long, long long) + 0x6D5 at ??:0
(show details)
Symptom |
Abend/Crash |
Environment |
All |
Trigger |
Code bug in 5.1.2 GA build |
Workaround |
Disable the quota function; or upgrade the file system version to a value larger than or equal to 4.1.1.0; or upgrade the pre-5.1.2.0 node to 5.1.2.0. |
|
5.1.2.1 |
Quotas |
IJ35924 |
Suggested |
If a proxy is configured for the CALLHOME component of IBM Spectrum Scale, the system health component CALLHOME (mmhealth node show) will become DEGRADED causing the PTF_Updates check to fail.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
A proxy is configured for CALLHOME |
Workaround |
Either do not use a proxy setup for callhome,
or disable the PTF_update check as follows:
1. On the callhome master node, edit the /var/mmfs/mmsysmon/mmsysmonitor.conf
file in the "[ptfupdates]" section to set the value "monitors_enabled = false".
2. Restart the system health monitoring with: mmsysmoncontrol restart |
|
5.1.2.1 |
Call home |
INFO001 |
Suggested |
The release of v5.1.2.0 aligned with the release of v5.1.1.4 PTF. Please refer to v5.1.1.x for list of APARs through v5.1.1.4.
|
5.1.2.0 |
INFO |
IJ35469 |
Suggested |
Upgrade from version prior to 5.1.1 then back out may cause CCR to stop working.
(show details)
Symptom |
Error output/message Cluster/File System Outage Upgrade/Install failure |
Environment |
All |
Trigger |
The problem occurs when a cluster upgrade to GPFS version 5.1.1 or later then back out of the upgrade. This only happens if the cluster has authorized cluster or remote cluster defined. |
Workaround |
Restore node by using mmsdrrestore. |
|
5.1.2.0 |
Core GPFS |
IJ34398 |
Suggested |
ACL changed when running AFM failover in SW
(show details)
Symptom |
Fileset root ACL gets changed |
Environment |
Linux |
Trigger |
Some fileset is being recreated or the fileset root metadata is changed while running failover. |
Workaround |
None |
|
5.1.2.0 |
AFM |