IBM Spectrum Scale APARs Resolved in 5.1.2.x

When network is poor, we may hit this assertion when TCP connection is connected or re-connected (show details)

Symptom	Abend/Crash
Environment	ALL Operating System environments
Trigger	Network is not good which leads to TCP connection reconnect
Workaround	No

5.1.2.15

All Scale Users

IJ49372

When running a workload on Windows which creates and deletes lots of files anddirectories in a short span, the inode number assigned for GPFS objects may bereused. If a stale inode entry somehow persists in the GPFS cache due to in flighthold counts, it can happen that due to conflict between the old and new objecttypes, this stale entry will result in a file or directory not found error. (show details)

Symptom	Unexpected Results/Behavior.
Environment	Windows/x86_64 only.
Trigger	Running a workload on Windows which continuously creates and deletes lots of files and directories quickly.
Workaround	None

5.1.2.15

All Scale Users.

IJ49543

Spectrum Scale Erasure code edition interacts with third party software/hardware APIs for internal disk enclosure management. If the management interface becomes degraded and starts to hang commands in the kernel, the hang may also block communication handling threads.This causes a node to fail to renew its lease, causing it to be fenced off from the rest of the cluster. This may lead to additional outages. A previous APAR was issued for this in 5.1.4, but that fix was incomplete. (show details)

Symptom	Hang/Deadlock/Unresponsiveness/Long Waiters
Environment	Linux Only
Trigger	Degradation in back-end storage management that causes commands to hang in the kernel.
Workaround	The node with hardware problems will show waiters 'Until NSPDServer discovery completes. 'It is recommended to reboot those nodes with those GPFS waiters exceeding 2 minutes if this node is also being expelled.

5.1.2.15

ESS/GNR

IJ49373

The daemon assert going off: fromNode != regP->owner in file allocM.C, which then resulted in daemon crashed. (show details)

Symptom	Daemon crash
Environment	All Operating Systems
Trigger	mmdefragfs or mmdf command is running while there is node failures or less free space in the file system.
Workaround	No

5.1.2.15

All Scale Users

IJ49542

pmsensor GPFSVFSX output 0 read and write stats but there are read/write operations, the problem here is that the format of data provided by mmpmon is not expected by Zimon, which caused the output to be wrong. (show details)

Symptom	Error output/message
Environment	ALL Operating System environments
Trigger	read GPFSVFSX stats
Workaround	None

5.1.2.15

perfmon (Zimon)

IJ49650

Today fallocate is prevented on AFM caching modes because there is no guarantee that afmctl file is present on this mode and so can't take the chance to support it. (show details)

Symptom	Error output
Environment	ALL Operating System environments
Trigger	Perform fallocate on AFM caching mode filesets (SW/IW)
Workaround	None

5.1.2.15

AFM

IJ49472

Snapshot creation cannot be done due to a background file deletion is running into infinite loop on a corrupted compression file. (show details)

Symptom	deadlock
Environment	All Operating Systems
Trigger	Corrupted compression file.
Workaround	None

5.1.2.15

Compression

IJ49473

Mmbackup invokes tslssnapshot command multiple times during snapshot backup. It has small performance impact if the file system has large number of snapshots. (show details)

Symptom	Performance Impact/Degradation
Environment	all platforms that support mmbackup.
Trigger	This problem could occur if snapshot backup is executed for the file system that has lots of snapshots.
Workaround	none

5.1.2.15

mmbackup

IJ49825

Sometimes, the system monitor may report a warning message: 'statd_multiple WARNING The rpc.statd process is running multiple times.'This is due to a forked short-lived process from the 'statd' process. (show details)

Symptom	sysmon may report following warning message. 'statd_multiple WARNING The rpc.statd process is running multiple times.'
Environment	All Linux
Trigger	This might happen if the 'statd' creates a fork process, and at the same time, sysmon checks for the 'statd' process. The 'statd' forked process is a short-lived process; hence, the forked process should not be counted.
Workaround	NA

5.1.2.15

System Health

IJ49540

When a RDMA connection to a remote node has to be shutdown due to network errors (e.g. network link goes down) it can sometimes happen that the affected RDMA connection will not be closed and all resources assigned to this RDMA connection (memory, VERBS Queue Pair, ...) are not freed. (show details)

Symptom	Unexpected Results/Behavior
Environment	ALL Linux OS environments
Trigger	verbsRdmaSend must be enabled. Loss of a RDMA connection to a node because of network errors in the RDMA fabric.
Workaround	No work around available

5.1.2.15

RDMA

IJ49541

Ganesha crashes will cause "health" alerts. The "/var/log/ganesha.log" will contain a crash backtrace that will look like free_client_id :RW LOCK :CRIT :Error 16, Destroy mutex 0x3fff6c2fedd0 (&clientid->cid_mutex) at nfs-ganesha-3.5-ibm071.22/SAL/nfs4_clientid.c:348 It contains "Error 16" and source code reference "nfs-ganesha-3.5-ibm071.22/SAL/nfs4_clientid.c:348" (show details)

Symptom	Ganesha Crash
Environment	Linux Only
Trigger	The problem mostly occurs if there is a delay in processing NFSv4 client's renew request due to a resource crunch.
Workaround	None

5.1.2.15

NFS

IJ49648

Files are not re-validated in AFM cascading relationship because of readdir optimization. This happens if the home fileset is AFM enabled with COS backend. (show details)

Symptom	Unexpected Results
Environment	All Linux OS environments
Trigger	AFM cascading relationship with AFM+COS fileset.
Workaround	None

5.1.2.15

AFM

IJ49649

GPFS daemon could fail unexpectedly with assert: Assert exp (nPrefetchedBuffers > 0). This could happen when DIO is used to append to a file. (show details)

Symptom	Abend/Crash
Environment	ALL Operating System environments
Trigger	Append to a file using DIO.
Workaround	Set dioReentryThreshold configuration variable to 2

5.1.2.15

All Scale Users

IJ49826

Kernel crash when executing programs that calls gpfs_ireadx() interface on DMAPI disabled file systems (e.g., mmrestorefs, or using "cp" or "tar" command that supports to detect sparse holes in source files with lseek(2) interface). (show details)

Symptom	Abend/Crash
Environment	Linux Only
Trigger	Executing programs that calls gpfs_ireadx() interface on DMAPI disabled file systems (e.g., mmrestorefs, or using "cp" or "tar" command that supports to detect sparse holes in source files with lseek(2) interface).
Workaround	None

5.1.2.15

All Scale Users

IJ48661

When mmbackup generates policy rule to select backup candidates, it composites pathname list using snapshot name in snapshot backup case. In fileset backup, mmbackup treats snapshot as fileset snapshot unconditionally even though the snapshot is global snapshot. Hence, generated pathname is incorrect. (show details)

Symptom	Component Level Outage
Environment	All platforms that support mmbackup
Trigger	Run fileset backup using global snapshot
Workaround	Use fileset snapshot for fileset backup

5.1.2.14

mmbackup

IJ46155

File creation or close is pending on the thread CloseHandlerThread with "waiting for dealloc queue flush" long waiter. (show details)

Symptom	The small files creations are pending on closes, then the performance of files creations is slowing down.
Environment	All Operating Systems
Trigger	Lots of file creations and closes while there are many other process doing space deallocations.
Workaround	None

5.1.2.14

All Scale Users

IJ48826

Prefetch, Recovery using list-files is using ftell on the open FILE pointer to get size of file and since this is 32 Bit in nature - it can end up getting junk value based on which the file split for threads in processing these happen. (show details)

Symptom	Prefetch fails to process the list file properly and is seen looping around with a smaller subset.
Environment	All Linux OS environments (AFM Gateway nodes)
Trigger	Running prefetch with a single large list file which is > 2GB in size.
Workaround	Split single large list file of > 2GB into smaller lists of < 2GB each and use for prefetch.

5.1.2.14

AFM

IJ48827

AFM replication fails with error 22 if the remote file mode is symlink during the write or create operation. (show details)

Symptom	Unexpected results
Environment	All Linux OS environments
Trigger	AFM cache conflict
Workaround	None

5.1.2.14

AFM

IJ49085

AFM resync fails with error 9 and queue will get stuck. (show details)

Symptom	Unexpected results
Environment	All Linux OS environments
Trigger	AFM resync
Workaround	None

5.1.2.14

AFM and AFM DR

IJ49086

File create performance could degrade when concurrently create many small files in many directories due to mutex contention. This would also lead to higher CPU usage by GPFS daemon. (show details)

Symptom	Performance Impact/Degradation
Environment	ALL Operating System environments
Trigger	Concurrent create of many small files in many directories
Workaround	Set maxInodeDeallocHistory configuration variable to 0

5.1.2.14

All Scale Users

IJ49087

Deleting snapshots or accessing snapshot files may fail with 214 error code and also a FSSTRUCT errNo=1116 (FSErrSnapInodeModified) is logged in system log file (show details)

Symptom	Operation fails with 214 error code and FSSTRUCT errNo=1116 logged in system log file
Environment	All Operating Systems
Trigger	File system manager node fails when updating files with snapshot existing
Workaround	None

5.1.2.14

Snapshot

IJ49088

Sometimes the snapshot deletion could take longer time than the earlier snapshot deletions (show details)

Symptom	Slow snapshot deletion
Environment	All Operating Systems
Trigger	Snapshot deletion when LROC device is configured on a client node.
Workaround	None

5.1.2.14

snapshot and LROC

IJ48869

File data loss when copying or archiving data from snapshot and clone files (e.g., using "cp" or "tar" command that supports to detect sparse holes in source files with lseek(2) interface). (show details)

Symptom	Data Loss
Environment	Linux Only
Trigger	Using the copy or archive tools that support to detect the sparse holes in the source file with the lseek(2) interface.
Workaround	Switch to use other copy or archive tools to copy or archive the data from snapshot and clone files.

5.1.2.14

Snapshot and clone files

IJ42454

File data loss when copying or archiving data from migrated files (e.g., using "cp" or "tar" command that supports to detect sparse holes in source files with lseek(2) interface). (show details)

Symptom	Data Loss
Environment	Linux Only
Trigger	Using the copy or archive tools that support to detect the sparse holes in the source file with the lseek(2) interface.
Workaround	Switch to use other copy or archive tools to copy or archive the data from migrated files, or recall the file before using the copy or archive applications.

5.1.2.14

DMAPI

IJ48629

Race between stat/gpfs_stalite() and inode token revoke causes log assert. (show details)

Symptom	Abend/Crash
Environment	ALL Operating Systems
Trigger	A file is actively written on one node and stat() or gpfs_statlite() is called repeatedly on another node
Workaround	Set config parameters statliteMaxAttrAge and statMaxAttrAge to 0 to disable stat lite.

5.1.2.14

All Scale Users

IJ48911

The assert going off on "logAssertFailed: oldDA1Found[i].compAddr(synched1[I])", then result in mmfsd daemon crashed and finally could cause file system can't be mounted on any node. (show details)

Symptom	Abend/Crash
Environment	All Operating Systems
Trigger	Run fsck to fix the duplicated disk address on compressed files.
Workaround	None

5.1.2.14

Compression

IJ47843

Kernel crash with assert: nPrefetchedBuffers > 0. This could happen when application using multiple threads to perform sequential read or write more than 65535 blocks on the same open file. The starting offset of the read/write must not be on GPFS block boundary. (show details)

Symptom	Abend/Crash
Environment	ALL Operating System environments
Trigger	Performing sequential read/write on the same file using multiple threads where starting offsets of each read/write is not on GPFS block boundary.
Workaround	Close/reopen file before performing more than 65535 sequential read/write on the same file using multiple threads.

5.1.2.13

All Scale Users

IJ48032

GNR daemon assert dpP->dpGetBlockDevice() == pdBlockDeviceRP goes off in response to certain pdisk device state changes, which will bring down mmfsd. This problem was introduced in GPFS 5.1.5.1, and impacts GNR systems running the following code levels:
- 5.1.2.4+
- 5.1.5.1+
- 5.1.6+
- 5.1.7.0 but not 5.1.7.1+ (5.1.7.1+ gets a workaround patch) (show details)

Symptom	Abend/Crash
Environment	Linux Only
Trigger	A condition occurs in which the pdisk device paths remain visible to the Operating System, but something happens such that the pdisk no longer believes it should be associated with the given block device that those paths represent. The most common cause for this dissociation if the pdisk descriptor labels at the earlier LBA become overwritten or corrupted. This kind ofcorruption is often the result of hardware errors, but it can occur if some external process interferes and corrupts the disk areas that are managed by GPFS and GNR.The dissociation step had a regression from another fix, which causes the assert. Other conditions for the dissociation are possible, but have not been properly identified as of the time of this fix.
Workaround	None

5.1.2.13

ESS/GNR

IJ48302

In certain cases reference on block device is not released due to which reference counter goes to large value and we cannot unload block device module (show details)

Symptom	Increase in reference count for block device
Environment	ALL Operating System environments
Trigger	When access to block device is made e.g while doing disk info, reference counter leak is triggered
Workaround	None

5.1.2.13

File system

IJ48287

AFM fails to replicate the files with afmFastCreate option if the newly created file is renamed to a different directory and it's original parent is deleted (show details)

Symptom	Unexpected results, file tree mismatch
Environment	All Linux OS environments
Trigger	Using afmFastCreate option to replicate data
Workaround	Disable afmFastCreate

5.1.2.13

AFM

IJ48288

Assert goes off when the temporary file is linked(created with O_TMPFILE and linkat) and the inode data have to be evicted to accommodate AFM xattrs. (show details)

Symptom	Crash
Environment	All Linux OS environments
Trigger	Temporary file is linked(created with O_TMPFILE and linkat) with data in inode on the AFM fileset
Workaround	None

5.1.2.13

AFM

IJ47005

When mmbackup calculates number of backup candidates, it counts migrated files as backup candidates. It is incorrect unless --backup-migrated is used, because migrated files will not be backed up. This incorrect calculation results in mmbackup completion with error due to "some skipped files". (show details)

Symptom	Component Level Outage
Environment	All platforms that support mmbackup>
Trigger	This problem could occur by mmbackup without --backup-migrated option when some of files are migrated.
Workaround	None>

5.1.2.12

mmbackup

IJ46806

mmchfileset -t (use to set fileset comment) cannot handle null string (show details)

Symptom	- Error output/message - Unexpected Results/Behavior
Environment	ALL Operating System environments
Trigger	mmchfileset fails to set null comment.
Workaround	Instead of a null string, you may want to use a single space as the input.

5.1.2.12

Admin Commands

IJ46804

IBM Storage Scale will crash at startup if RDMA is enabled and the number of RDMA devices on a node exceeds 128. (show details)

Symptom	Abend/Crash
Environment	ALL Linux OS environments
Trigger	RDMA is enabled and the number of RDMA devices on a node exceeds 128.
Workaround	Disable RDMA or reduce the number RDMA devices to 128 or less.

5.1.2.12

RDMA

IJ47006

When reconnect happens, we may encounter an error with errno 76, which indicates the connection is not connected, and results in LOGSHUTDOWN. (show details)

Symptom	Abend/Crash
Environment	ALL Operating System environments
Trigger	Network is not good which leads to TCP connection reconnect
Workaround	None

5.1.2.12

All Scale Users

IJ46805

On SW mode AFM fileset - if a new directory is created at home and directory prefetch with this new directory is run with --force option, then the SW cache should be able to cache all data in this new dir. But since mmafmctl validates locally for this new directory (and since SW cannot fetch this new dir from home) - prefetch of new directory with --force results in no such dir error. (show details)

Symptom	Unexpected Behavior
Environment	All Linux and AIX OS environments.
Trigger	Performing prefetch on a new directory created at home on an SW mode AFM fileset at the cache.
Workaround	Perform mmafmlocal rstat ${dir} - before performing the --force directory prefetch on SW fileset.

5.1.2.12

AFM

IJ47007

If the name of a File System contains one or more underscore characters '_' and Clustered Watch Folder is enabled on said File System then events that are supposed to be delivered to the sink are never sent. (show details)

Symptom	Unexpected Results/Behavior
Environment	ALL Operating System environments
Trigger	A File System Name that contains one or more underscore characters.
Workaround	Remove underscore chars from File Systems Names where Clustered System Watch is to be used.

5.1.2.12

Watch Folder

IJ47001

Adding any disk into a file system, the ill_unbalanced flag would be set to indicate that the file system can be further rebalanced. With this ill_unbalanced flag, the mmhealth will see it and downgrade the file system until an mmrestripefs command -b option is done. (show details)

Symptom	mmhealth report ill_unbalanced_fs state.
Environment	All Operating Systems
Trigger	Adding descOnly disk to a Scale file system.
Workaround	None

5.1.2.12

All Scale Users

IJ47004

Two client nodes are working on the same two regions for block deallocations and each client node owns one region of the two and doing the flush for the region it owns, meanwhile, the DeallocHelperThread on each client node is also requesting the ownership for the region owned by the other client node, then the revoke ownership request would be blocked on each other because the two regions are in flushing state but pending for ownership request from each other, thus forms a deadlock. (show details)

Symptom	Deadlock
Environment	All Operating Systems
Trigger	Users files data block deallocations from at least two different client nodes.
Workaround	Restart GPFS on the client node showing long waiter on allocMsgTypeRequestOwnership RPC message from DeallocHelperThread.

5.1.2.12

All Scale Users

IJ47003

AFM gateway running on RHEL8.8 and RHEL 9.2 fails to perform full readdir operation at the cache which results in partially fetching the entries from the home. (show details)

Symptom	Unexpected results, data mismatch.
Environment	RHEL 8.8 and RHEL 9.2 Linux OS environments
Trigger	Upgrading to newer RHEL releases 8.8 and 9.2
Workaround	Downgrade RHEL to earlier versions.

5.1.2.12

AFM and AFM DR

IJ46764

Scale daemon assert going off: Assert exp(regP->isOwnerLocal() == 0) in file allocR.C, results in Scale mmfsd daemon process down. (show details)

Symptom	Abend/Crash
Environment	All Operating Systems
Trigger	Heavy block space allocation and deallocation in the cluster.
Workaround	None

5.1.2.12

All Scale Users

IJ45040

GPFS daemon assert: exp(getDeEntType() == detUnlucky) in Direct.h. This could occur when there are concurrent access to the same directory with one node perform delete on a file while another node try to create the same file. (show details)

Symptom	Abend/Crash
Environment	ALL Operating System environments
Trigger	Concurrent delete/create of same file in a directory from multiple nodes.
Workaround	Avoid delete/create same file in a directory from multiple nodes at same time.

5.1.2.12

All Scale Users

IJ47220

A race condition between the distributed GNR Disk hospital can cause a state update from the master node to a worker node to be rejected.
When the master node wishes to release a disk from the "diagnosing" to "ok" state, it sends a state broadcast to all worker nodes to instruct them to reflect the pdisk's new master state locally.
However, this broadcast can race with addition disk problem reports that are transmitted from the worker to the master.
The result is that the worker node can reject the master's claim that the disk is healthy, and continue holding the disk in diagnosing.
This can lead to blocked file system I/O unless another state change notification is broadcasted from the master, in which case the worker gets another change to resume I/O to the disk.
(show details)

Symptom	Stuck IO
Environment	Linux Only
Trigger	This problem can potentially occur when any local I/O error is encountered on a pdisk, but in general the race condition in that path is rare. It is more likely to occur on Spectrum Storage Scale Erasure Code edition during periods of network instability when pdisks are likely to encounter many timeout errors.
Workaround	Restarting the daemon on the nodes with the waiter "Until disk availability stabilizes" can clear out the waiters.

5.1.2.12

ESS/GNR

IJ46382

Online replica compare function could incorrectly flag replica mismatch on certain metadata file such as symbolic link in an AFM enabled file system. (show details)

Symptom	Error output/message
Environment	ALL Operating System environments
Trigger	Run online replica compare function.
Workaround	Ignore replica mismatch on special metadata file such as link.

5.1.2.12

AFM

IJ47406

Change to nsdRAIDDefaultIoTimeout is reset to default after gpfs restart (show details)

Symptom	Unexpected Results/Behavior
Environment	ALL Linux OS environments
Trigger	Restart gpfs daemon
Workaround	Use mmchconfig nsdRAIDDefaultIoTimeout=xxx -i after gpfs is restarted.

5.1.2.12

ESS/GNR

IJ47407

The newer lscpu command lists CPU family after the Model name. This causes the code that detects and automatically applies a workaround for GSKIT hangs issue does not work as expected. Commands like mmcrcluster or mmaddnode may hang in GSKIT layer on AMD EPYC family 23 and 25 processors. (show details)

Symptom	Installation and admin commands hang.
Environment	Linux OS environments
Trigger	This problem affects AMD EPYC family 23 and 25 processors running with newer version of lscpu command.
Workaround	Add "ICC_SHIFT=3" line in /usr/lpp/mmfs/lib/gsk8/C/icc/icclib/ICCSIG.txt file on problem nodes.

5.1.2.12

Admin Commands, gskit

IJ47408

AFM doesn't check the state of a message when dropping it using the \"mmfsadm afm msgdrop\" option. Its better to leave inflight messages be - and drop a message in any other state. Dropping inflight messages has a long term implication on the queue. It either hits a safety assertion or a Signal 11/6 somewhere to lose the queue. (show details)

Symptom	Crash
Environment	All Linux OS Environments (Acting as AFM Gateway nodes)
Trigger	Dropping a message in the AFM queue that is inflight.
Workaround	User has to carefully put queue into suspended state and then drop messages.

5.1.2.12

AFM

IJ47409

This APAR addresses two issues related to NFS-Ganesha that can cause crashes. Here are the details:
Issue 1:
NFS-Ganesha may crash with the following stack trace:
(gdb) bt
(gdb) bt
#0 0x00003fffa73e52e8 in raise () from /lib64/libpthread.so.0
#1 0x00003fffa7954628 in crash_handler (signo=6, info=0x3ffefac4a468, ctx=0x3ffefac496f0)
at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/MainNFSD/nfs_init.c:247
#2 <signal handler called>
#3 0x00003fffa717fcb0 in raise () from /lib64/libc.so.6
#4 0x00003fffa718200c in abort () from /lib64/libc.so.6
#5 0x00003fffa79b9fd4 in free_client_record (record=0x3fff200ed130) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/SAL/nfs4_clientid.c:1381
#6 0x00003fffa79ba3d8 in dec_client_record_ref (record=0x3fff200ed130) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/SAL/nfs4_clientid.c:1461
#7 0x00003fffa79b825c in nfs_client_id_expire (clientid=0x3fff200edbd0, make_stale=false)
at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/SAL/nfs4_clientid.c:914
#8 0x00003fffa79c7820 in reserve_lease_or_expire (clientid=0x3fff200edbd0, update=true)
at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/SAL/nfs4_lease.c:181
#9 0x00003fffa7a59db4 in nfs4_op_renew (op=0x3fff029152d0, data=0x3fff0320d9c0, resp=0x3ffee960cab0)
at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/Protocols/NFS/nfs4_op_renew.c:91
#10 0x00003fffa7a2ed80 in process_one_op (data=0x3fff0320d9c0, status=0x3ffefac4cfd0)
at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/Protocols/NFS/nfs4_Compound.c:920
#11 0x00003fffa7a30010 in nfs4_Compound (arg=0x3ffeeabd84a0, req=0x3ffeeabd7c90, res=0x3ffee9854f60)
at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/Protocols/NFS/nfs4_Compound.c:1327
#12 0x00003fffa794dae4 in nfs_rpc_process_request (reqdata=0x3ffeeabd7c90)
Issue 2:
NFS-Ganesha may crash with the following stack trace:
#0 0x00007f27f0a984fb in raise () from /lib64/libpthread.so.0
#1 0x00007f27f2775d7b in crash_handler (signo=11, info=0x7f20e337e930, ctx=0x7f20e337e800) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21/MainNFSD/nfs_init.c:247
#2 <signal handler called>
#3 0x00007f27f28a3cf5 in nlm_granted_callback (obj=0x7f2430001378, lock_entry=0x7f2204302c20) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21/Protocols/NLM/nlm_util.c:609
#4 0x00007f27f27b133b in try_to_grant_lock (lock_entry=0x7f2204302c20) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21/SAL/state_lock.c:1732
#5 0x00007f27f27b177b in process_blocked_lock_upcall (block_data=0x7f2204305510) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21/SAL/state_lock.c:1780
#6 0x00007f27f27ac19c in state_blocked_lock_caller (ctx=0x7f21c8408650) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21/SAL/state_async.c:81
#7 0x00007f27f27f62bd in fridgethr_start_routine (arg=0x7f21c8408650) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21/support/fridgethr.c:556
#8 0x00007f27f0a90ea5 in start_thread () from /lib64/libpthread.so.0
#9 0x00007f27f018fb0d in clone () from /lib64/libc.so.6
(show details)

Symptom	Crash
Environment	Linux Only
Trigger	- For Issue 1, the crash is related to the NFSv4 lease period and can occur due to timing issues, such as delays in lease renewal or a heavily loaded server with multiple client requests. - For Issue 2, the crash is related to blocking lock requests and lock upgrades on the same file by multiple threads, which can lead to timing issues.
Workaround	None

5.1.2.12

NFS-Ganesha crash followed by CES-IP failover.

IJ46628

AFM Recovery uses an external program to detect renames/removes done that were not replicated.
This external program was seen to leak few memory blocks which is now addressed. (show details)

Symptom	Unexpected Behavior
Environment	All Linux OS Plarforms (AFM Gateway nodes)
Trigger	AFM recovery triggered with renames/removes that need to be recovered.
Workaround	None

5.1.2.11

AFM

IJ46269

Adding/Removing Gateway node roles to the cluster when Active I/O is happening to an AFM fileset can cause deadlocks owing to how the node join/leave protocol handles leading to One applicaiton node thinking of a certain Gateway node to be the Gateway node for the fileset Vs other nodes thinking other nodes to be fileset gateway nodes. (show details)

Symptom	Deadlock
Environment	ALL Operating System environments
Trigger	Running mmchnode --gateway/--nogateway when there is Active I/O happening on AFM filesets.
Workaround	Avoid running mmchnode --gateway/--nogateway when there is Active I/O happening on AFM filesets.

5.1.2.11

AFM

IJ46270

A GPFS Windows node that has been running for a few hours, may enter a state where-in even under no load, the idle GPFS threads might spin causing 100% CPU utilization.
This is because of a potential error in time management and computation on Windows. (show details)

Symptom	Performance Impact/Degradation.
Environment	Windows/x86_64 only.
Trigger	GPFS must be up and running on a Windows node for a few hours.
Workaround	A possible work-around is to bounce GPFS on the Windows node (mmshutdown followed by mmstartup).

5.1.2.11

Windows performance.

IJ46271

AFM Gateway node shall hit an assertion when running IO from application node to a dependent fileset inside AFM independent fileset or AFM filesystem level replication enabled. (show details)

Symptom	Crash
Environment	All Linux OS environments (AFM Gateway nodes)
Trigger	Running I/O to dependent fileset inside AFM independent fileset or to an AFM enabled Filesystem.
Workaround	None

5.1.2.11

AFM

IJ46272

With QoS throttling configuration on a subset of nodes in the cluster, the I/Os on the rest client nodes without QoS throttling are seriously throttled unexpectedly. (show details)

Symptom	I/O hang
Environment	All Operating Systems
Trigger	Configure QoS throttling for a subset nodes in the cluster.
Workaround	Create a node class for the non-QoS throttled nodes and set "unlimited" QoS throttling for that node class when configuring QoS for a subset nodes in the cluster.

5.1.2.11

QoS

IJ46273

There were unknown NFS errors hit during recovery and there were no bypass around these to get recovery to go through. (show details)

Symptom	Unexpected Behavior
Environment	All Linux OS Environments (AFM Gateway nodes)
Trigger	Recovery unable to proceed upon hitting unknown persistent AFM Recovery errors.
Workaround	None

5.1.2.11

AFM

IJ46274

The tsapolicy adds each client process (agent) information to agentVctr to keep track activities.
If agent is retrieved from agentVctr While a helper is being added, it could get vogus agent address and it could result tsapolicy hang.
Adding lock while retrieving agent info can avoid this problem. (show details)

Symptom	Component Level Outage
Environment	All platforms that support mmapplypolicy
Trigger	This problem could occur by mmapplypolicy with large number of client nodes (-N option)
Workaround	None

5.1.2.11

mmapplypolicy

IJ46395

During filesystem restripe process, for example, mmrestripefs -R, a file replication setting may be changed if the file is ill-replicated, and quota is not handling correctly after the file data blocks are replicated or un-replicated as needed to match the new replication settings.
As result, some quota accounting data become unreliable over time. (show details)

Symptom	Wrongly quota accounting data.
Environment	ALL Operating System environments
Trigger	Quota is not handling correctly from data blocks replicated or un-replicated logic.
Workaround	Run mmcheckquota to correct quota values.

5.1.2.11

Quotas

IJ46396

getfacl may not display a POSIX default ACL that has been set on a directory.
This occurs in this situation:
- A default ACL is set on a directory in a Scale filesystem using setfacl, but not an access ACL.
- The filesystem is shared using the NFS server included with the operating system.
- The NFS client mounts the filesystem using NFS version 3.
Functionally things seem to work correctly even though getfacl is missing the default ACL information. (show details)

Symptom	Under certain circumstances, getfacl command will not display information about the default ACLs that has been set on a directory using setfacl.
Environment	ALL Operating System environments
Trigger	getfacl may not display a POSIX default ACL that has been set on a directory. This occurs in this situation: - A default ACL is set on a directory in a Scale filesystem using setfacl, but not an access ACL. - The filesystem is shared using the NFS server included with the operating system. - The NFS client mounts the filesystem using NFS version 3.
Workaround	Also set the access ACL using setfacl on affected directories.

5.1.2.11

- NFS
- POSIX default ACLs

IJ46397

The TCT recall process could fail or report some errors during deleting a non- resident (stub) file that is also in a snapshot. (show details)

Symptom	Unexpected behavior and results.
Environment	All Operating Systems
Trigger	Deleting a non-resident stub file that is also in a snapshot.
Workaround	Deleting the snapshots that contains such being deleted non- resident stub file.

5.1.2.11

TCT migration/LWE

IJ46531

A read() or write() system call on a file descriptor opened in direct I/O mode accessing file data on a locally attached NSD may hang. (show details)

Symptom	Hang/Deadlock/Unresponsiveness/Long Waiters
Environment	ALL Linux OS environments
Trigger	When preparing the block I/O request to the local attached block device the GPFS kernel module is unable to get a handle for the block device. This can for example happen if the block device is temporarily unavailable.
Workaround	Do not use direct I/O for data stored on locally attached NSDs. Or use direct I/O only on remote attached NSDs.

5.1.2.11

NSD Client/Server handling

IJ46533

There is a code issue that could result in that an AIO completion event could be not handled by the AIO completion thread, then form a deadlock with long waiter for the thread of AcquireBRTHandlerThread or RangeRevokeWorkerThread waiting for other threads to exit fast path. In addition, such miss-handling for AIO completion could also cause the file system cannot be quiesced and memory leak issue. (show details)

Symptom	Deadlock
Environment	Linux Only
Trigger	Doing AIO reads/writes from one node and then start normal buffer I/O load from the other nodes against the same files.
Workaround	No

5.1.2.11

AIO only

IJ46534

The syntax of mmdsh is as follow:
mmdsh -N {Node[,Node...] | NodeFile | NodeClass}
[-l LoginName] [-i] [-s] [-r RemoteShellPath]
[-v [-R ReportFile]] [-f FanOutValue] Command

In the following example, mmdsh will remove /tmp/someFile.
mmdsh -N "ls -lrt /tmp/someFile"

In this example, the intended nodelist {Node[,Node...] | NodeFile | NodeClass} is missing.
The command takes the next token, the string "ls -lrt /tmp/someFile" as a node list.
It calls a GPFS internal command to obtain a list of nodes in the cluster.
The call to the internal command was not properly done.
The internal command takes file /tmp/someFile as an output file which it removes before write new data to it. (show details)

Symptom	Unexpected Results/Behavior
Environment	ALL Operating System environments
Trigger	Running mmdsh with node list argument.
Workaround	Review the manpage carefully and enter the command correctly especially ensure the correct list of nodes before you run mmdsh.

5.1.2.11

Admin Commands

IJ46535

With prefetch run with list-file larger than 2GB, prefetch threads are deployed - and there are 1 or more threads that need to start operating at list-file offset higher than 2GB. Since the offset/length is declared to be Int32 - the offset of higher than 2GB is too big to hold and causes fseek errors which returns an error 4 (E_INTR). (show details)

Symptom	Unexpected results.
Environment	All Linux OS Platforms (AFM Gateway nodes)
Trigger	Running prefetch with list file larger than 2GB in size.
Workaround	Split single large list file into multiple smaller list files (each smaller than 2GB in size) and run prefetch multiple times with list file chunks created.

5.1.2.11

AFM

IJ46536

When initial sync is triggered for a GPFS fileset (resync in progress) which is converted to AFM DR, none of the files/dirs have pcache remote or pcache parent EAs on the inode.
If the initial sync is interrupted (and workload causes some remove/rename kind of operations on the local directories), then there are dirty directories at primary for which dirtyDirDirents policy scan is run to list all dirty dir entries.
But for files which don't have state, the policy scan gives 5 fields as compared to 8 expected and results in recovery failing with error 22 (E_INVAL). (show details)

Symptom	Unexpected Results
Environment	All Linux OS Environments (Serving as AFM Gateway nodes)
Trigger	Running AFM recovery on an AFM DR Fileset who's initial sync has never completed.
Workaround	Set afmSkipResyncRecovery to yes on the fileset and trigger recovery.

5.1.2.11

AFM

IJ46649

sendfile() call returns EINVAL for kernel > 5.10 when target is gpfs file system. (show details)

Symptom	sendfile system call failure
Environment	Linux with kernel >= 5.10
Trigger	sendfile() call returns EINVAL for kernel >= 5.10 when target is GPFS file system
Workaround	None

5.1.2.11

Core GPFS

IJ45538

When afmFastCreate is configured and a normal file gets copied - it sets the cache/primary mtime at the home/Secondary during file create time itself.
Later if Write gets interrupted mid-way and later a Resync is run on this fileset - the same file is not copied over fully stating that the file is already in sync. It involves a small race to get here. (show details)

Symptom	Unexpected Behavior
Environment	All Linux OS platforms (AFM Gateway nodes only)
Trigger	Running Resync on fileset with afmFastCreate enabled with a partially copied file from cache to home.
Workaround	None

5.1.2.10

AFM

IJ44899

File usage quota is effective and some files in the file system have been migrated through DMAPI application, then delete all files in the file system. However, the mmrepquota consistently shows some files still in-use. (show details)

Symptom	Error Output
Environment	Linux
Trigger	Migrate files to external storage through DMAPI function, and create snapshots for the file system, then delete these migrated files.
Workaround	Since this problem only happens when there are snapshots, so deleting snapshots can workaround the problem.

5.1.2.10

DMAPI

IJ44889

Attr_Expiration_Time value set in EXPORT_DEFAULTS block of gpfs.ganesha.main.conf not reflected in the new export entry created. (show details)

Symptom	Check if Attr_Expiration_Time value is proper in /var/mmfs/ces/nfs-config/gpfs.ganesha.exports.conf for the export added using mmnfs
Environment	Linux
Trigger	Modify Attr_Expiration_Time to different value other than default value and add new export
Workaround	Attr_Expiration_Time can be modified in gpfs.ganesha.exports.conf using below steps: 1. Copy /var/mmfs/ces/nfs-config/gpfs.ganesha.exports.conf to /tmp 2. Edit Attr_Expiration_Time field in required export in /tmp/gpfs.ganesha.exports.conf 3. Run below command to copy /tmp/gpfs.ganesha.exports.conf back. mmnfs export load /tmp/gpfs.ganesha.exports.conf

5.1.2.10

NFS-Ganesha

IJ44682

When file system manager takeover happens as a result of node failure, GPFS will try to do log recovery as needed. Log recovery needs to do disk fencing to prevent further IO on a disk from failed node. If the disk fencing failed, it will proceed to make the disk down, but the problem is that, marking the disk down needs to wait for at least some part of file system manager takeover to finish. This ends with a deadlock issue. (show details)

Symptom	Hang/Deadlock/Unresponsiveness/Long Waiters
Environment	ALL
Trigger	File system manager takeover and log recovery happens with disk fencing error condition, under some timing
Workaround	None

5.1.2.10

All Scale Users

IJ45067

Performance degradation resulting in long wait times when doing IOs from Ganesha without File Audit Logging enabled. (show details)

Symptom	Performance Degradation
Environment	Linux Only
Trigger	Doing IO (e.g ls, stat, ...) from a Ganesha mount without FAL enabled.
Workaround	None

5.1.2.10

File Audit Logging

IJ45540

When opening a file with DIO and issuing AIO I/O requests, the requests need to be aligned to the sector size. This is enforced by GPFS and an error is returned. The problem here was that an internal error code was returned (795), instead an error code that is recognized by Linux applications. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	Open a file for DIO and issue I/O requests through AIO.
Workaround	None

5.1.2.10

All Scale Users

IJ45068

A file that is open in Truncate mode, and a write generated on it is later getting a Read on it to cache the file from home in AFM IW mode.
This causes Read to see Write as dependent and ends up deadlocking. (show details)

Symptom	Deadlock
Environment	Linux
Trigger	Open AFM cached file in Truncate mode and Write to it.
Workaround	None

5.1.2.10

AFM

IJ45553

'waiting for stripe group takeover' and 'waitForPendingCopyBlockRPCs: nn RPCs pending'.
These long waiters indicate a deadlock that prevents the file system from coming up. (show details)

Symptom	Deadlock
Environment	ALL
Trigger	Sudden death of a NSD server preventing access to some disks
Workaround	Bring all NSD servers up

5.1.2.10

All Scale Users

IJ45548

ownload for MU and Object-only mode for COS fileset is not working. If cacheBit is set for fileset root then download is not happening. (show details)

Symptom	Download failed to happened.
Environment	Linux
Trigger	Download won’t happen if fileset root is cached.
Workaround	None

5.1.2.10

AFM-COS

IJ45266

Daemon assert " logAssertFailed: !fileId.isSnaplinkDir()" going off when calling lseek against the .snapshots directory. (show details)

Symptom	Daemon crash
Environment	ALL
Trigger	Perform lseek request against the .snapshots directory.
Workaround	Avoid lseek call to the .snapshots directory.

5.1.2.10

Snapshot

IJ45549

This issue happens when afmFastCreate is enabled and a file is created and written to, and immediately a changeSecondary is run on the fileset (pushing the File Create through fastCreate to the Resync snapshot) on a Primary fileset.
In this case, the file is written from the psnap0 snapshot to keep this snapshot consistent across the sites and sacrifices the data consistency on live fileset. (show details)

Symptom	Unexpected Behavior
Environment	All Linux OS environments (Serving as AFM Gateway nodes)
Trigger	Running ChangeSecondary on DR fileset with afmFastCreate enabled while still writing to a single file.
Workaround	None

5.1.2.10

AFM

IJ45403

The GPFS kernel module exports an ioctl interface used by the mmfsd daemon and some of the mm* commands. The provided improvements result in a more robust functionality of the kernel module. (show details)

Symptom	Unexpected Results/Behavior
Environment	ALL
Trigger	Not available
Workaround	None

5.1.2.10

All Scale Users

IJ45550

When an immutable/appendonly file at primary/cache is made non-imm/non-app first using mmchattr and then immediately the file is removed - then AFM has an issue where the file remove cannot replicate to secondary/home because the file is still imm/app and not allowed to be removed. (show details)

Symptom	Unexpected behaviour
Environment	ALL
Trigger	Making an immutable/appendonly file at primary/cache as non-imm/non-app first using mmchattr and then immediately removing the file.
Workaround	Wait for the mmchattr -i no -a no to be replicated first to the secondary/home and file at home/secondary also becomes non-imm/non-app first and later remove the file.

5.1.2.10

AFM

IJ45551

The communication port cannot be changed for CCR enabled cluster. (show details)

Symptom	Error output/message
Environment	ALL
Trigger	mchconfig command
Workaround	Change the cluster to deprecated server-based configuration then change the port. After the change, change the cluster back to support CCR cluster.

5.1.2.10

Admin Commands

IJ45608

Due to an issue identified in offline fsck mmfsck it can cause it to report false positive lost blocks and also not report properly genuine incorrect blocks and duplicates. (show details)

Symptom	Will see corruptions like duplicates even after offline fsck repair and subsequent offline fsck runs will show lost blocks and incorrect blocks.
Environment	ALL
Trigger	This issue will happen on a file system where the user created two or more dataOnly pools and then at some point of time deleted the earlier data pool/s in an out of order fashion (i.e. a dataonly pool (n) is deleted with other data pools (n+x) are present).
Workaround	1) Create one or more "dummy" dataOnly pool by adding a single NSD of that "dummy" dataOnly pool to the file system. The NSD of this "dummy" data pool can be of a minimum small size as we do not need to have any data on that "dummy" data pool. 2) After that run offline fsck on the file system and now it should report and repair lost blocks/incorrect block and duplicates in the right way. 3) Once the file system is fixed you can delete the "dummy" data pool by deleting the only NSD in it.

5.1.2.10

FSCK

IJ45536

Spectrum Scale and systemhealth monitor (sysmon) start independently after a node reboot.
During initialization, Spectrum Scale checks if all declared NFS exports are available.
The sysmon configuration has the flag "preventnfsstartuponmissingfs" enabled, so the expected behavior was that NFS is not started if a required filesystem is unmounted. But in fact, NFS was started anyway. (show details)

Symptom	Unexpected Results/Behavior
Environment	ALL Linux OS environments running CES with enabled NFS protocol
Trigger	Spectrum Scale and systemhealth monitor (sysmon) start independently after a node reboot. During initialization, Spectrum Scale checks if all declared NFS exports are available. At that point in time the sysmon was still initializing and has not yet done this evaluation. So it returns "no bad configuration found" which triggers then the NFS startup. The sysmon configuration has the flag "preventnfsstartuponmissingfs" enabled, so the expected behavior was that NFS does not come up. Ganesha will fail later and trigger an IP address failover, which disturbs the cluster operation.
Workaround	N/A Make sure that the exported filesystems have the automount feature enabled, if possible. If the missing exported filesystem is not in use anyway, then remove it from the declared export list.

5.1.2.10

System Health

IJ45537

The tsapolicy evaluates each client's workload and rebalance them if some clients are overloaded.
But the workload is sometimes incorrectly calculated and tsapolicy tries to rebalance unnecessary and could get into an infinite loop. (show details)

Symptom	Component Level Outage
Environment	All platforms that support mmapplypolicy
Trigger	This problem could occur if number of inodes in the file system or fileset is large.
Workaround	None

5.1.2.10

mmapplypolicy

IJ45777

Empty file is having cache bit set and crtime is getting updated in Readdir operation which caused to skip validation with home on truncate operation at home and failed to download the file. (show details)

Symptom	Failed to call truncated file at COS
Environment	Linux
Trigger	Data consistency failed on empty truncated file.
Workaround	None

5.1.2.10

AFM-COS

IJ45776

There is a peculiar case where the local bit on the .ptrash directory inside AFM filesets gets reset. This causes the .ptrash directory to be treated like a normal directory and in Write modes, the temporary files generated for recovery/resync policy start getting replicated to the remote site. For Read modes this causes the ptrash directory to show up as a dangling entry because a normal lookup is sent to home - and since the .ptrash doesn't have remote attrs - it fails to complete this lookup successfully. This also causes errors when the user wants to empty the ptrash with rm -rf since the lookups to remote site don't succeed. (show details)

Symptom	Unexpected Behavior
Environment	All Linux OS environments (AFM Gateway nodes) All OS Platforms (Application nodes in AFM enabled clusters)
Trigger	ptrash local bit getting reset unintentionally and follow up operations performed on the fileset - like ls or recovery
Workaround	Manually set the local bit on ptrash on seeing issues.

5.1.2.10

AFM

IJ45797

There is an assert being hit when performing ls -l or prefetch on a brand new RO/LU/IW fileset with data existing at home. (show details)

Symptom	Crash
Environment	All Linux OS environments (AFM Gateway nodes)
Trigger	Running ls or prefetch on new RO/IW/LU fileset at cache with home having data already.
Workaround	None

5.1.2.10

AFM

IJ44525

Due to a race condition between the RDMA software layer and IBM Spectrum Scale, it is possible that an application running on an IBM Spectrum Scale client may read incorrect data from files stored on GPFS under certain conditions. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	Race condition between the RDMA software layer and IBM Spectrum Scale.
Workaround	Disable RDMA.

5.1.2.9

RDMA

IJ44527

TIP events can be hidden and then should not count towards the overall state of the system, however they still can cause the component rollup to show TIP instead of Healthy. (show details)

Symptom	Unexpected Results/Behavior
Environment	All
Trigger	Hiding Tips will prevent them from showing up in the Events column but did not exclude them from the overall state calculation.
Workaround	None

5.1.2.9

System Health

IJ44547

GPFS daemon could fail unexpectedly with assert: regP->owner!=fromNode,in allocM.C. This could happen as result of file system unmounted on a node due to error. (show details)

Symptom	Abend/Crash
Environment	All
Trigger	File system unmount due to error
Workaround	Disable the assert via disableAssert configuration

5.1.2.9

All Scale Users

IJ44553

Code to set ptrash as local was designed to be enabled if afmRevalOpWaitTimeout remains at its default value. But it's highly unlikely it stays default in a customer environment. (show details)

Symptom	Unexpected Behavior
Environment	Linux (AFM Gateway nodes)
Trigger	afmRevalOpWaitTimeout being set to a non-default value causing ptrash local bit setting code to not take effect.
Workaround	Setting the afmRevalOpWaitTimeout to its default value of 180 will ensure ptrash is set to local.

5.1.2.9

AFM

IJ44567

System monitoring collects all information about a cluster by sending it to relevant nodes. It ignores cluster boundaries while doing so, which does not work and creates spurious error messages in the logs. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	Setup with remote cluster integration
Workaround	None

5.1.2.9

mmfs.log.latest

IJ44684

Remote error 2 while replicating Link operation if parent directory is deleted before replicating create/link operation. (show details)

Symptom	AFM Queue drop and Fileset goes to resync state.
Environment	Linux
Trigger	Create/Link/Parent dir remove operation in queue with Fast Create config option enabled.
Workaround	None

5.1.2.9

AFM

IJ44685

The mmwatch plugin to mmhealth can print or log excess error messages if there is a filesystem that is offline for some reason. (show details)

Symptom	Error output/message
Environment	Linux
Trigger	Running mmhealth when there is an unmountable filesystem defined.
Workaround	The mmwatch plugin to mmhealth can be disabled.

5.1.2.9

Admin Commands

IJ44691

Spectrum Scale Erasure code edition interacts with third party software/hardware APIs for internal disk enclosure management. If the management interface becomes degraded and starts to hang commands in the kernel, the hang may also block communication handling threads. This causes a node to fail to renew its lease, causing it to be fenced off from the rest of the cluster. This may lead to additional outages. (show details)

Symptom	Hang/Deadlock/Unresponsiveness/Long Waiters
Environment	Linux
Trigger	Degradation in back-end storage management that causes commands to hang in the kernel.
Workaround	The node with hardware problems will show waiters 'Until NSPDServer discovery completes.' It is recommended to reboot nodes with those GPFS waiters exceeding 2 minutes if this node is also being expelled.

5.1.2.9

ESS/GNR

IJ44692

"mmdiag --netNwork" is slow. (show details)

Symptom	Slow performance in environments with lots of network entries
Environment	All
Trigger	A large number of network entries causes "mmdiag --network" to be noticeably slow.
Workaround	None

5.1.2.9

mmdiag

IJ44774

Commands like mmcrcluster or mmaddnode may hang in GSKIT layer on AMD EPYC family 25 processors. A particular model from family 25 that is known to hang in GSKIT layer is AMD EPYC 7343. (show details)

Symptom	Admin commands hangs
Environment	Linux
Trigger	This problem affects AMD EPYC family 25 processors
Workaround	Add "ICC_SHIFT=3" line in /usr/lpp/mmfs/lib/gsk8/Cicc/icclib/ICCSIG.txt file on problem nodes.

5.1.2.9

Admin Commands, gskit

IJ44806

In GPFS backend, cleanup took the handlerList lock on SGPanic and at the same time, handler is trying to setup (setupctl) the fileset mount path by using handler mutex and this is waiting too (show details)

Symptom	Waiters
Environment	Linux
Trigger	Waiters will be seen and fileset is stuck to show progress.
Workaround	None

5.1.2.9

AFM with GPFS backend

IJ44828

A node (kernel) crash can occur when the vinfoLockOnWrite config option is enabled. (show details)

Symptom	Crash
Environment	All
Trigger	Timing hole when enabling the undocumented config option vinfoLockOnWrite, likely triggered by using snapshots
Workaround	Avoided by not enabling the undocumented vinfoLockOnWrite config option

5.1.2.9

Core GPFS

IJ44829

The special .afmctl file at home/secondary loses its Control attribute and is treated as a normal file. This returns a buffer of expected 2048 size - overflowing the 1100 buffer given for this at cache - expecting a CTL file treatment at the home/secondary (show details)

Symptom	Crash
Environment	Linux (AFM Gateway nodes)
Trigger	Invalid .afmctl control file at home.
Workaround	Manually disable and re-enable mmafmconfig at the home/secondary and then stop/start the cache fileset to pickup the new changes from home.

5.1.2.9

AFM

IJ44831

After GPFS 5.1.2 release, on some token manager node, the memory from token management subpool may be leaked.
This can be observed from output of mmfsadm dump malloc:
Statistics for MemoryPool id 3 ("UNPINNED_TM") at 0xF1000012C00246C8:
...
Memory subpool 'HolderList' at 0xF1000012C00258B0
objSize 16 spObjectsPerChunk 65536 expandInProgress 0
inUse 140052583 free 63385 total 140115968 limit 2147483647
the "inUse" filed is increased gradually.
(show details)

Symptom	Out-of-memory, Unexpected Results/Behavior
Environment	All
Trigger	During token management, one type of object is missed freed when the token is destroyed.
Workaround	None

5.1.2.9

All Scale Users

IJ43542

When reading the small files containing single data block from multiple threads, these threads could be interlocked with each other on the prefetch check, although they are reading different small files, then the contention on the prefechListMutex cause the performance degradation. A similar issue could happen on concurrent append writes to files from multiple threads. (show details)

Symptom	Performance Impact/Degradation
Environment	All
Trigger	Concurrent reads on different small files from different threads.
Workaround	Downgrade the configuration parameter "prefetchAggressiveness" to the 1.

5.1.2.8

All

IJ44213

The ptrash directory triggers recovery if the fileset is in dropped state. This recovery code tries to set the Ptrash as local where it tries to acquire XW lock on ptrash - which conflicts with the above operation which also holds XW lock on the ptrash directory while trying to queue the operation. (show details)

Symptom	Deadlock
Environment	Linux
Trigger	Remove operations performed on unwanted files inside the .ptrash directory when the local bit was not set on this directory causing recovery to be triggered.
Workaround	Make sure that ptrash directory is always local before performing any operations inside them.

5.1.2.8

AFM

IJ41370

Currently, there is no mechanism to cleanup the subnets contact IPA caches. If the subnets configuration changes and the cached IPA does not work any more, the nodes may not be able to communicate with each others. (show details)

Symptom	Performance Impact/Degradation
Environment	All
Trigger	Normally, GPFS will use daemon IP address for communication, but if the cluster want to use other IP address for communication, they must configure "subnets" configuration. Then GPFS will use "subnets" IP address for daemon communication. But we need to do following: - In probing cluster stage, a pair of nodes use daemon IP addresses for communication. - After the connection is established, pairs of nodes exchange their "subnets" IP addresses - Close the connection which is using daemon IP addresses - Establish new connection which is using "subnets" IP addresses. So, once the "subnets" IP addresses are cached, GPFS uses these cached IP for communication. The problem occurs when cache "subnets" IP addresses are no longer communicative. Even if a new "subnets" is configured or "subnets" is removed, we cannot use the original "subnets" to exchange the new IP address which the customer wants to use.
Workaround	Manually cleanup the stale /var/mmfs/gen/cacache.* files.

5.1.2.8

subnets/remote cluster

IJ42748

Assertion: exp(fileId.inodeNum > 0) (show details)

Symptom	Crash
Environment	Linux
Trigger	Over stressed Filesystems with AFM DR bi-directional relationships running.
Workaround	None.

5.1.2.8

AFM

IJ43816

The SLES 15 SP4 kernel update 5.14.21-150400.24.11 included a change that causes Spectrum Scale to crash the kernel. A fix in Spectrum Scale is necessary in order to run on this kernel. (show details)

Symptom	Abend/Crash
Environment	x86_64-linux only
Trigger	Run Spectrum Scale with the SLES 15 SP4 kernel update 5.14.21-150400.24.11.
Workaround	None.

5.1.2.8

All

IJ41364

With a policy rule configured, there are many jobs that could be scheduled accordingly and the 32 bit pitJobId could be overflowed over time, which causes the assert "(pitJobId >= 0 && pitJobListPP == __null)". (show details)

Symptom	Abend/Crash
Environment	All
Trigger	Configure policy rule to frequently migrate (or compress/decompress and etc) the file system data.
Workaround	None

5.1.2.8

Policy

IJ44219

Files not replicated on create after failoverToSecondary. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	After failovertosecondary, if you create and write files and then changesecondary to sync with old primary.
Workaround	None

5.1.2.8

AFM-DR

IJ44155

An attacker can gain sensitive information like vulnerable Framework, components, etc. used, if error message are not handled properly. (show details)

Symptom	mmvdisk throw exception.
Environment	All
Trigger	Nnon-ascii characters in the configuration file causes mmvdisk to throw an exception.
Workaround	None

5.1.2.8

ESS/GNR

IJ44144

Fixes for the retbleed vulnerability are backported to kernel updates in Linux distributions. These fixes also include checks whether the kernel module build has properly applied the rtbleed mitigations. Parts of the kdump binary built by mmbuildgpl does purposefully not include these mitigations. As a result, the mmbuildgpl process will emit warning messages like CC [M] /usr/lpp/mmfs/src/gpl-linux/kdump-kern.o /usr/lpp/mmfs/src/gpl-linux/.tmp_kdump-kern.o: warning: objtool: GetOffset()+0x14: 'naked' return found in RETHUNK build (show details)

Symptom	Error output/message
Environment	x86_64-linux only
Trigger	Running mmbuildgpl on a kernel that has all fixes for the retbleed vulnerability.
Workaround	There is no easy workaround. Without code changes, the only way forward is to ignore those warnings, no other ill effect will happen.

5.1.2.8

All

IJ44143

CES ips are not getting assigned to the node and moving around. (show details)

Symptom	Unexpected results/behavior
Environment	None
Trigger	CES resume
Workaround	Assign the CES ip to the node

5.1.2.8

None

IJ44119

Add vinfoLockOnWrite config to hold vinfo lock for file write operation. Enabling this config can solve the write performance degradation of Ganesha/NFS found between GPFS 5.1.1 and GPFS 5.1.2. (show details)

Symptom	Performance Impact/Degradation
Environment	All
Trigger	Ganesha/NFS write performance degradation is more likely to occur if the number of Ganesha threads is large.
Workaround	None

5.1.2.8

NFS

IJ44073

If a recovery group creation fails due to a condition in the storage hardware, such as the detection of volatile write caching on the drives, the “mmvdisk recovery group create” command will fail. Once the hardware issue is resolved, it is possible for subsequent attempts of this command to continue to fail until the mmfsd daemons are restarted on the Spectrum Scale RAID storage cluster. (show details)

Symptom	Error output/message
Environment	Linux
Trigger	Hardware problems detected during recovery group creation.
Workaround	Restart the mmfsd daemons on the Spectrum Scale RAID storage cluster.

5.1.2.8

GNR/ESS

IJ44059

"noAuthentication=yes" can cause sysmonitor daemon to crash which stops mmhealth from working. (show details)

Symptom	Abend/Unexpected Results/Behavior
Environment	Linux
Trigger	Setting noAuthentication=yes
Workaround	None

5.1.2.8

System Health

IJ43755

Rename on non-empty directory in AFM+COS local-updates mode is not allowed causing the application failures on rename. (show details)

Symptom	Unexpected results
Environment	Linux
Trigger	AFM caching with rename on non-empty directory in AFM+COS local-updates mode
Workaround	None

5.1.2.8

AFM

IJ42737

NFS fails to resolve the posix filesystem, when a 'tmpfs' type is mounted prior to adding any gpfs export, and then unmounted. This happens because NFS does not repopulate the posix filesystem which leads to mismatch of major and minor number of exports (show details)

Symptom	Unexpected results
Environment	All
Trigger	tmpfs filesystem remains in the posix list maintained by NFS; the list which is not re-populated for every new export add.
Workaround	Restart NFS ganesha incase there is major/minor number mismatch of the filesystems

5.1.2.8

NFS-Ganesha

IJ44054

LDAP connections are being monitored using the bind passwaord. If that is obfuscated the monitor may fail. (show details)

Symptom	Unexpected results
Environment	Linux
Trigger	Using LDAP server with obfuscated bind PW
Workaround	None

5.1.2.8

System Health

IJ42759

GPFS commands are calling egrep which produces warning on latest Cygwin64 update. (show details)

Symptom	Error output/message
Environment	Windows/x86_64 only
Trigger	Cygwin64 update
Workaround	None

5.1.2.8

Admin Commands

IJ43799

Node expel logic tries to avoid expelling nsd servers but in ECE environment it cannot determine this. (show details)

Symptom	Node expel/Lost Membership
Environment	All
Trigger	NSD servers
Workaround	None

5.1.2.8

GNR

IJ43806

mmafmcosctl object download prints Queued number of items for metedata downloaded, when actually its just directly processing them without queuing. (show details)

Symptom	Error Message
Environment	Linux
Trigger	Running mmafmcosctl download with metadata only option.
Workaround	None

5.1.2.8

AFM

IJ41697

The "dig" command used to query the status crashed on Ubuntu and SLES when called from the sysmonitor daemon. (show details)

Symptom	Unexpected Results/Behavior
Environment	x86_64-linux only (except RHEL)
Trigger	Issue does not affect the RHEL. With the SLES and Ubuntu the std stream handling in the sysmon daemon caused the "dig" command to crash.
Workaround	None

5.1.2.7

System Health

IJ41620

Running mmbuildgpl on x86_64 with Linux kernels that include fixes for the retbleed vulnerability (CVE-2022-29900) results in an error. As a result, GPFS is not usable with these kernel versions. Specifically, this problem is hit with:

SLES 15 SP3 kernel update 5.3.18-150300.59.87.1 or higher
SLES 15 SP4 kernel update 5.14.21-150400.24.11.1
Ubuntu 22.04 kernel update 5.15.0-45.48

It is expected that the same changes will also be backported to RHEL, but no RHEL kernel updates with retbleed fixes have been released yet. The same applies to Ubuntu 20.04; no kernel updates have been released yet with this changes, but this should happen eventually.

The information provided by the Linux distributions are useful references: https://www.suse.com/security/cve/CVE-2022-29900.html https://ubuntu.com/security/CVE-2022-29900 https://access.redhat.com/security/cve/CVE-2022-29900

(show details)

Symptom	Component Level Outage (GPFS will be unusable on the node).
Environment	Linux (x86_64)
Trigger	This problem occurs when updating the Linux kernel to a version with retbleed patches included.
Workaround	The required change can also be applied manually: Edit the file /usr/lpp/mmfs/src/gpl-linux/Kbuild Around line 100 there is a line: $(KBHOSTPROGS) := lxtrace Before that line, add a new one with: CFLAGS_kdump-kern.o += -mfunction-return=keep Save the file and run mmbuildgpl again.

5.1.2.7

Core GPFS

IJ41473

Files or directories that are accessed through CES NFS (Ganesha) and also concurrently accessed at the same time, can report wrong inode attributes. This can appear as data corruption. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	Access a file (or directory) through NFS Ganesha and I modify the same file from another node.
Workaround	Since this problem is tied to using CES NFS, not using CES NFS can avoid this problem.

5.1.2.7

NFS-ganesha

IJ41758

Part of GPFS are kernel modules that are loaded upon startup and used by other components. Usage counters were not used correctly in the tracedev module, which can lead to the module being unloaded while still in use, resulting in a kernel crash. One case where this is possible is running the "mmvdisk server configure" and "mmvdisk server unconfigure" commands with the --recycle option. (show details)

Symptom	Abend/Crash
Environment	Linux
Trigger	Run GPFS shutdown and startup. This is a rare problem, so running this or the mentioned "mmvdisk server" command in a loop will be necessary to trigger the problem.
Workaround	Avoid stopping GPFS immediately after starting up.

5.1.2.7

Core GPFS

IJ41651

Linux kernel 4.2 added a new field to the Linux inode data structure. When an inode is reused under heavy workload, this field might not be initialized correctly, leading to a kernel crash when accessing the symlink. (show details)

Symptom	Abend/Crash
Environment	Linux (excluding RHEL7)
Trigger	This problem is highly depended on the workload. If there is a workload creating directories, creating files underneath directories, deleting directories and also creating symlinks, there is a chance that this problem is hit. Build systems are a type of software that can exhibit this pattern.
Workaround	It is possible to manually patch the GPl layer: Edit the file /usr/lpp/mmfs/src/gpl-linux/inode.c In function cxiSetOSNode after line: "case S_IFLNK:" insert a new line with: inodeP->i_link = NULL; Run mmbuildgpl again and restart GPFS on the node.

5.1.2.7

Core GPFS

IJ41831

If a policy scan, initiated from the mmbackup command, fails and the mmbackup shadowDB file contains an entry for a file that was previously backed up but is now deleted, and the inode of that file has been assigned to a newly created file, then the mmbackup shadowDB file will have duplicate records for that file. (show details)

Symptom	Component Level Outage
Environment	All
Trigger	This problem occurs under the following conditions: An entry exist in the mmbackup shadowDB for a file that has been deleted. The inode for the file described in condition 1 has been assigned to a newly created file that needs to be backed up. The policy scan done by mmbackup fails.
Workaround	Fix the root cause of policy scan failure and rebuild shadowDB.

5.1.2.7

mmbackup

IJ42150

mmafmctl prefetch -Y hits segfault (show details)

Symptom	segfault.
Environment	Linux
Trigger	mmafmctl prefetch command with -Y option
Workaround	None

5.1.2.7

AFM

IJ42164

AFM gateway daemon crashes during the resync due to invalid logAssert. (show details)

Symptom	Crash
Environment	Linux
Trigger	AFM replication
Workaround	None

5.1.2.7

AFM

IJ42165

AFM+COS replication gets stuck with requeued messages when a file is created, deleted and recreated with the same name before the replication is started to the COS. (show details)

Symptom	Unexpected results.
Environment	Linux
Trigger	AFM replication
Workaround	None

5.1.2.7

AFM

IJ42267

While using mmafmcsctl download --all, the download will fail if the directory contains a space in the name. (show details)

Symptom	Failed to download files
Environment	Linux
Trigger	mmafmcosctl download --all
Workaround	None

5.1.2.7

AFM

IJ41327

When one CES node gets rebooted, NFS client lock requests might fail with a "NLM_DENIED" error. (show details)

Symptom	Lock request will fail (NLM_DENIED or NLM_BLOCKED error can be seen in tcpdump reply frame of LOCK Request).
Environment	All
Trigger	When one of the protocol nodes of the cluster gets rebooted or a failover happens and a lock request is attempted on the same file./td>
Workaround	None in NFSv3. Issue not present in NFSv4. So one work around can be using NFSv4 instead of NFSv3.

5.1.2.7

NFS-ganesha

IJ42301

AFM recovery fails with error 80 due to incorrect checks for the inode attributes. This error causes the replication to be stuck. (show details)

Symptom	Unexpected results.
Environment	Linux
Trigger	AFM recovery
Workaround	None

5.1.2.7

AFM

IJ42467

AFM gateway node deadlocks during the read operation if both prefetch and application tries to read the same file simultaneously. (show details)

Symptom	Deadlock
Environment	All
Trigger	Read operation on AFM uncached file
Workaround	None

5.1.2.7

AFM

IJ42500

Below assert going off:

logAssertFailed: totalReceived == scatteredP->scattered_total_len || (totalReceived == 0 && scatteredIndex == scatteredP->scattered_count)

(show details)

Symptom	Abend/Crash
Environment	All
Trigger	Network is not good which leads to TCP connection reconnect.
Workaround	None

5.1.2.7

Core GPFS

IJ42511

When an NVMe device is becoming active, it is necessary for ESS to poll the device to determine if it is ready for I/O. It does this by polling the final LBA of the device to see if reads are allowed. This is because the devices become visible to the OS prior to becoming ready to handle read/write requests.

The original implementation, however, would incorrectly claim that media errors on the final LBA mean that the device isn't ready. As a result, it is possible that legitimate media problems on the final LBA of an NVMe will induce ESS to claim that the entire device is not available.

This problem can be identified by an NVMe pdisk going missing after seeing unrecovered read errors in the Spectrum Scale RAID recovery group event log (mmvdisk recoverygroup list --events).

(show details)

Symptom	Component Level Outage
Environment	Linux
Trigger	Corrupted physical block mapped to the final logical block within an NVMe namespace.
Workaround	None

5.1.2.7

ESS/GNR

IJ42229

If verbsRdmaSend configuration is enabled, and the verbs connection is disconnected and reconnected due to any error other than node shutdown or node failure, it may cause some RPC reply messages to be left in the internal table unintentionally.

These messages will remain in the internal table forever, as none of ack messages can clean them up. Deadlock will not occur immediately, because these RPC messages have been processed correctly. However, the problem may occur when the 32-bit message IDs are wrapped and reused.

Some new messages may be recognized as duplicated RPCs and be rejected by the destination node. These new messages will stay in 'pending' state and result in deadlock.

(show details)

Symptom	Hang/Deadlock/Unresponsiveness/Long Waiters
Environment	All
Trigger	For a cluster which has the verbsRdmaSend configuration enabled, this problem may occur if the verbs connection is disconnected and reconnected due to any error other than node shutdown or node failure (for example because of network issue).
Workaround	Recycle GPFS daemon.

5.1.2.7

RDMA

IJ43167

mmbuildgpl fails on SLES 15.3, new kernel 5.3.18-150300.59.90-default with error as below:

“No rule to make target 'vmlinux', needed by '/usr/lpp/mmfs/src/gpl-linux/kdump-kern-dummy.ko”

(show details)

Symptom	mmbuildgpl will fail on SLES 15.3 kernel version 5.3.18-150300.59.90-default
Environment	SLES 15.3 kernel version 5.3.18-150300.59.90-default (all architectures).
Trigger	mmbuildgpl will fail on SLES 15.3 when kernel is upgraded to 5.3.18-150300.59.90-default
Workaround	Clear KBUILD_BUILTIN macro inside /usr/lpp/mmfs/src/gpl-linux/Kbuild KBUILD_BUILTIN := This can be done after below surrounding code: #For s390x: -pg and -fomit-frame-pointer are incompatible ifeq ($(ARCH),s390) ifdef CONFIG_FUNCTION_TRACER ORIG_CFLAGS := $(KBUILD_CFLAGS) KBUILD_CFLAGS = $(subst -pg,,$(ORIG_CFLAGS)) endif endif KBUILD_BUILTIN :=

5.1.2.7

Build

IJ43330

logAssertFailed: fileId.inodeNum > 0 when running AFM Recovery or Resync (show details)

Symptom	Lost Membership
Environment	Linux
Trigger	Role Reversal to make old Primary as Secondary and the old Secondary being promoted to Primary.
Workaround	None

5.1.2.7

AFM

IJ40659

Trace parameters set through the mmtracectl command does not keep the node classes. (show details)

Symptom	Unexpected behavior
Environment	All
Trigger	Set trace parameters with mmtracectl command.
Workaround	Explicitly set the trace parameters via mmchconfig command.

5.1.2.6

Admin

IJ40707

The mmlsquota reports duplicate lines when issuing the -C option. (show details)

Symptom	Duplicate output
Environment	All
Trigger	Specify the Device argument that also belongs to the remote cluster in the -C argument.
Workaround	Specify the Device argument that does not belong to the -C ClusterName.

5.1.2.6

Admin Commands

IJ40709

GPFS fails to process the kmipServerUri field in a remote key manager stanza in the RKM.conf file if provided as an IPv address, e.g., kmipServerUri = tls://[fd9a:f0d0:1002:11::31]:5696. (show details)

Symptom	Failure to read files from encrypted file systems/sets.
Environment	All
Trigger	None
Workaround	Use the hostname instead.

5.1.2.6

Security

IJ40754

Running a blocking trace when the node is low on memory and swapping, can lead to a deadlock. (show details)

Symptom	Hang/Deadlock/Unresponsiveness/Long Waiters
Environment	Linux
Trigger	Run traces in blocking mode, while the available memory is low and processes are getting swapped out.
Workaround	Ensure that sufficient free memory is available, so that the trace tool is not being swapped out.

5.1.2.6

Trace

IJ40815

AFM Recovery procedure sometimes fails with error 112. (show details)

Symptom	Unexpected Behavior
Environment	Linux (AFM gateway node)
Trigger	Running recovery on a fileset who's .ptrash directory has local bit reset on it.
Workaround	Setting the ptrash bit manually on the .ptrash directory (if it is found to be reset)

5.1.2.6

AFM

IJ40817

A node delete for in an ECE cluster will cause the declustered array to be stuck in critical rebuild, preventing the system from doing any data rebuild function. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	Remove an ECE node with mmvdisk.
Workaround	None

5.1.2.6

ESS, GNR

IJ39267

CCR becomes slow on a quorum node when the configured firewall drops the FIN TCP/IP packages of CCR requests. (show details)

Symptom	Performance impact/degradation.
Environment	Linux (x86_64)
Trigger	Misconfigured firewall.
Workaround	None

5.1.2.6

CCR

IJ40863

"mmsdrrestore --ccr-repair" is not removing CCR tiebreaker disks from the cluster configuration in case those CCR tiebreaker disks aren't available when this command is executed. This happens only in case the CCR nodes file '/var/mmfs/ccr/ccr.nodes' is not available on the quorum nodes. (show details)

Symptom	Unexpected results/behavior
Environment	All
Trigger	'/var/mmfs/ccr/ccr.nodes' not available on the quorum nodes in conjunction with CCR tiebreaker disks not accessible on those quorum nodes.
Workaround	None

5.1.2.6

CCR, Admin command "mmsdrrestore --ccr-repair"

IJ39112

Mutex contention could lead to slow write performance on AIX when there are multiple threads trying to flush the same file that contain many blocks at same time. (show details)

Symptom	Performance Impact/Degradation.
Environment	AIX/Power, Windows (x86_64)
Trigger	Multiple threads invoking sync on the same file at the same time.
Workaround	None

5.1.2.6

Core GPFS

IJ40726

A problem was identified when running in a mixed level cluster where some nodes support msgqueue and others do not. Excessive librdkafka threads will be created for each IO event on the 5.1.2+ nodes resulting in thread exhaustion for that particular node. (show details)

Symptom	Hang/Deadlock/Unresponsiveness/Long Waiters
Environment	Linux
Trigger	Running a cluster where msgqueue is supported. Upgrading a node to 5.1.2+ where msgqueue is no longer supported. Running IO to the 5.1.2+ node.
Workaround	None

5.1.2.6

Watch Folder, File audit logging

IJ41097

Symlink is not fetched from home on AFM cache fileset if the gateway kernel version is ≥ 5.10. This happens because memory is not allocated for symlink target path. (show details)

Symptom	Unexpected results
Environment	Linux
Trigger	AFM caching with symlinks.
Workaround	None

5.1.2.6

AFM

IJ41098

AFM gateway asserts when replicating the Rmdir operation on a dependent fileset. (show details)

Symptom	Assert
Environment	Linux
Trigger	AFM caching with dependent filesets.
Workaround	None

5.1.2.6

AFM

IJ41099

Resync is not able to create hardlink if the file is evicted while the link op is in the queue. (show details)

Symptom	Hardlink operation requeued.
Environment	Linux
Trigger	AFM caching with hardlinks
Workaround	None

5.1.2.6

AFM

IJ41100

Lookup on hardlinks fails intermittently on AFM cache filesets. This is due to a race between multiple threads performing the lookup of the same hardlink from different directories. (show details)

Symptom	Unexpected results
Environment	Linux
Trigger	AFM caching with hardlinks
Workaround	None

5.1.2.6

AFM

IJ41101

If file is accessed by SMB and AFM tries to replicate the same file, it requeues the operation due to lock conflict. It replicates it later when the file is closed by SMB. (show details)

Symptom	Write operation requeued.
Environment	Linux
Trigger	Simultaneous access of file from SMB and AFM.
Workaround	None

5.1.2.6

AFM

IJ41105

When rename/remove operations are performed on dependent filesets which are linked inside AFM independent filesets, and these operations get replicated to the remote site - the local removed/renamed inodes are not reclaimed resulting in extra inodes being held inUse than actually necessary. (show details)

Symptom	Unexpected Behavior
Environment	Linux (AFM Gateway nodes)
Trigger	Remove/Rename being performed on the dependent fileset inodes - when this dependent fileset is linked under an AFM independent fileset.
Workaround	None

5.1.2.6

AFM

IJ41014

After upgrading Scale on the exporting kNFS nodes to 5.1.3.0 (or 5.1.2.4) NFS clients mounting from Scale report stale file handles after a while. (show details)

Symptom	IO Error
Environment	Linux
Trigger	This is triggered by the NFS client sending a NFS commit message to the Linux kernel nfsd server on a GPFS node. The exact trigger depends on the NFS client, and memory usage on the NFS client system, so can be hard to predict.
Workaround	There is no direct workaround. Using the CES protocol stack, which uses NFS Ganesha could be a workaround, but is a larger config change.

5.1.2.6

NFS

IJ41211

Objects are not fully prefetched at the Cache on reading 4th block when afmPrefetchThreshold is set to 0, and io pattern is random. (show details)

Symptom	Unexpected Behavior
Environment	Linux (AFM Gateway nodes)
Trigger	In RO/LU/IW/SW mode of operation, with AFM COS as the backend have an uncached file (evict file in case of SW or IW from cache). Read 4 data blocks randomly on the file at cache.. (make sure no 2 blocks are read sequentially).
Workaround	Read 4 blocks sequentially as compared to random.

5.1.2.6

AFM

IJ41133

When recovering from a kafka down period, an audit event is sent to indicate the number of events that were dropped as well as a subEvent indicating what happened. This subEvent contained invalid json. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	External kafka server needs to be down while audit events are being created. Once the external kafka server is back up, it will receive this event.
Workaround	An exception can be caught with json parsers when the invalid j on is detected.

5.1.2.6

Watch folder

IJ41134

After upgrade spectrum scale version from 5.1.2.0-5.1.2.3 to 5.1.2.4 or any higher version, the NFSv4 client will throw below "unknown error 521" and failed to access NFS share (show details)

Symptom	NFSv4 Clients throws "unknown error 521" after upgrade.
Environment	All
Trigger	The issue is because of NFSv4 File handle size change in 5.1.2.4 or any higher version.
Workaround	unmount and remount NFSv4 share in all NFS clients.

5.1.2.6

cNFS, NFS

IJ41254

NFS-ganesha crashed with below stack during file lock request from nfs client.
(gdb) bt

#0 0x00007f889ae809bf in raise () from /lib64/libpthread.so.0

#1 0x00000000004427b8 in crash_handler (signo=11, info=0x7f86fa1debb0, ctx=0x7f86fa1dea80) at /usr/src/debug/ gpfs.nfs-ganesha-2.7.5-ibm067.04.295048.el8.x86_64/ MainNFSD/nfs_init.c:239

#2 <signal handler called>

#3 lock_entry_dec_ref (lock_entry=0x7f86386bd2c0) at /usr/src/debug/ gpfs.nfs-ganesha-2.7.5-ibm067.04.295048.el8.x86_64/ SAL/state_lock.c:708

#4 0x00000000004ae6fa in free_cookie (cookie_entry=0x7f85c02144a0, unblock=true) at /usr/src/debug/ gpfs.nfs-ganesha-2.7.5-ibm067.04.295048.el8.x86_64/ SAL/state_lock.c:1371

#5 0x00000000004af31e in state_complete_grant (cookie_entry=0x7f85c02144a0) at /usr/src/debug/ gpfs.nfs-ganesha-2.7.5-ibm067.04.295048.el8.x86_64/ SAL/state_lock.c:1717

#6 0x0000000000499098 in nlm4_Granted_Res (args=0x7f84fc602e38, req=0x7f84fc602730, res=0x7f84fc309a00) at /usr/src/debug/ gpfs.nfs-ganesha-2.7.5-ibm067.04.295048.el8.x86_64/ Protocols/NLM/nlm_Granted_Res.c:101

#7 0x000000000045a0ab in nfs_rpc_process_request (reqdata=0x7f84fc602730) at /usr/src/debug/ gpfs.nfs-ganesha-2.7.5-ibm067.04.295048.el8.x86_64/ MainNFSD/nfs_worker_thread.c:1331

#8 0x000000000045a97e in nfs_rpc_valid_NLM (req=0x7f84fc602730) at /usr/src/debug/ gpfs.nfs-ganesha-2.7.5-ibm067.04.295048.el8.x86_64/ MainNFSD/nfs_worker_thread.c:1593

#9 0x00007f889c8e6538 in svc_vc_decode (req=0x7f84fc602730) at /usr/src/debug/ gpfs.nfs-ganesha-2.7.5-ibm067.04.295048.el8.x86_64/ libntirpc/src/svc_vc.c:834

#10 0x000000000044d1aa in nfs_rpc_decode_request (xprt=0x7f866c7d4500, xdrs=0x7f84fc557f70) at /usr/src/debug/ gpfs.nfs-ganesha-2.7.5-ibm067.04.295048.el8.x86_64/ MainNFSD/nfs_rpc_dispatcher_thread.c:1349

#11 0x00007f889c8e6449 in svc_vc_recv (xprt=0x7f866c7d4500) at /usr/src/debug/ gpfs.nfs-ganesha-2.7.5-ibm067.04.295048.el8.x86_64/ libntirpc/src/svc_vc.c:807

#12 0x00007f889c8e2b91 in svc_rqst_xprt_task (wpe=0x7f866c7d4758) at /usr/src/debug/ gpfs.nfs-ganesha-2.7.5-ibm067.04.295048.el8.x86_64/ libntirpc/src/svc_rqst.c:779

#13 0x00007f889c8e3050 in svc_rqst_epoll_events (sr_rec=0x5572860, n_events=1) at /usr/src/debug/ gpfs.nfs-ganesha-2.7.5-ibm067.04.295048.el8.x86_64/ libntirpc/src/svc_rqst.c:956

#14 0x00007f889c8e32e9 in svc_rqst_epoll_loop (sr_rec=0x5572860) at /usr/src/debug/ gpfs.nfs-ganesha-2.7.5-ibm067.04.295048.el8.x86_64/ libntirpc/src/svc_rqst.c:1029

#15 0x00007f889c8e339f in svc_rqst_run_task (wpe=0x5572860) at /usr/src/debug/ gpfs.nfs-ganesha-2.7.5-ibm067.04.295048.el8.x86_64/ libntirpc/src/svc_rqst.c:1065

#16 0x00007f889c8ebc2d in work_pool_thread (arg=0x7f84c0006270) at /usr/src/debug/ gpfs.nfs-ganesha-2.7.5-ibm067.04.295048.el8.x86_64/ libntirpc/src/work_pool.c:181

#17 0x00007f889ae7614a in start_thread () from /lib64/libpthread.so.0

#18 0x00007f889a783dc3 in clone () from /lib64/libc.so.6
(show details)

Symptom	Crash
Environment	All
Trigger	The users might hit the crash if the same file is accessed by multiple clients and lot of overlapping file operations(create/delete/lock).
Workaround	None

5.1.2.6

cNFS, NFS-ganesha

IJ41328

Synchronous on-demand Read which triggers recovery on Independent-Writer mode fileset can block recovery if the home file is in migrated state. (show details)

Symptom	Performance Impact
Environment	Linux (AFM gateway node)
Trigger	Sync read being performed on IW fileset which needs recovery to be run and the file being migrated to HSM at the home site.
Workaround	Trigger recovery separately on IW fileset through ls or touch operations and then trigger such sync reads on uncached file which might be migrated to HSM at the home site.

5.1.2.6

AFM

IJ41280

CES cluster is showing an obscure error. (show details)

Symptom	None
Environment	None
Trigger	CES resume
Workaround	Bring the CES cluster up with no errors.

5.1.2.6

None

IJ41281

Cluster manager takeover thread causes deadlock when UID remapping is enabled. (show details)

Symptom	Deadlock
Environment	All
Trigger	UID remapping with cluster manager takeover.
Workaround	None

5.1.2.6

UID remapping

IJ41282

Daemon asserts when the number of UID remap entries are more than 8192. This issue happens due to an incorrect logAssert when UID remapping is enabled. (show details)

Symptom	Assert
Environment	All
Trigger	UID remapping and user entries to remap are greater than 8192.
Workaround	None

5.1.2.6

UID remapping

IJ41374

FSSTRUCT errors logged in the system log file, and after formatting these errors with the fsstructlx.awk tool, the FSSTRCUT error is FSErrValidate (108) i with type=eaOverflowBlock. (show details)

Symptom	FSSTRUCT error reported in system log file.
Environment	All
Trigger	Extended attributes does exhaust the free inode space and start to allocate overflow blocks, while snapshot is in use as well.
Workaround	None

5.1.2.6

Snapshot and extended attribute

IJ39624

On latest Cygwin (versions ≥ 3.3), an attempt to uninstall GPFS on Windows might display a dialog box complaining about access denied on uninstall.lnk. The dialog box presents options to Abort, Retry, or Ignore the error. Ignoring the error bypasses the issue and results in a successful uninstall. (show details)

Symptom	Upgrade/Install failure.
Environment	Windows (x86_64)
Trigger	Cygwin version ≥ 3.3.
Workaround	When presented with the dialog box complaining about uninstall.lnk, click on "Ignore" and that should let the uninstall complete. Then from an elevated Cygwin terminal: cd /usr/lpp/mmfs/support; chmod 777 uninstall.lnk; rm uninstall.lnk

5.1.2.5

Install, Upgrade

IJ39626

Not all ACL update interfaces understand and preserve the rich Windows ACL flags that only get set via native Windows ACL interfaces such as icacls or Explorer GUI. For example, mmputacl on any supported platform could clobber these flags. Hence, even if the ACL-flags are somehow blank, there still might be a valid ACL. (show details)

Symptom	Unexpected Results/Behavior.
Environment	Windows (x86_64)
Trigger	GPFS ACL updates (like mmputacl, mmeditacl etc) that do not preserve the rich Windows ACL flags.
Workaround	None

5.1.2.5

Authentication, ACLs

IJ39945

Replica mismatch could occur if file system panic or node fails while there are directIO writes in progress. This could happen on a file system with data replication and rapid repair is enabled. (show details)

Symptom	Unexpected Results/Behavior
Environment	All
Trigger	File system panic or node failure while directIO write is in progress and down disk in 1 or more replica
Workaround	Disable rapid repair feature on the file system.

5.1.2.5

Core GPFS

IJ39946

Assert respPP ≠ NULL in AFM environment. (show details)

Symptom	Crash
Environment	Linux
Trigger	Unresponsive AFM home
Workaround	None

5.1.2.5

AFM

IJ40024

AFM Object download does not honor refresh intervals causing performance issues. For example, the list operation is sent to the COS before the refresh interval. (show details)

Symptom	Performance impact
Environment	Linux
Trigger	Object download on a large bucket.
Workaround	None

5.1.2.5

AFM

IJ40027

AFM is not able to bail out stuck messages on the replication queue when afmFastCreate is enabled and the home is stuck (show details)

Symptom	Long Waiters
Environment	All
Trigger	Having afmFastCreate enabled on AFM replication fileset and having huge Writes on files which might get stuck on a home which is not responding.
Workaround	None

5.1.2.5

AFM

IJ40028

GPFS daemon assert: exp(updateInProgress == 0) in file repUpdate.C (show details)

Symptom	Abend/Crash
Environment	All
Trigger	Multiple node read/write to the same file.
Workaround	None

5.1.2.5

Core GPFS

IJ38093

Deadlock after changing the AFM gateway node using mmchnode command as the node change is not propagated correctly to all the nodes in the cluster. (show details)

Symptom	Deadlock
Environment	Linux
Trigger	mmchnode --gateway/--nogateway.
Workaround	Restart GPFS on gateway nodes.

5.1.2.5

AFM

IJ40029

Collecting data about running threads on the node (e.g. from a gpfs.snap), concurrently to a mmfsd restart can crash the node. (show details)

Symptom	Abend/Crash
Environment	Linux
Trigger	Collect debug data for kernel threads (e.g. from gpfs.snap) concurrently while mmfsd is restarting.
Workaround	Avoid debug data collection (e.g. gpfs.snap) while mmfsd is restarting.

5.1.2.5

Core GPFS

IJ40034

AFM object replication fails on files with 64-bit inode numbers. (show details)

Symptom	Unexpected results
Environment	Linux
Trigger	Upload on objects with 64-bit inode numbers.
Workaround	None

5.1.2.5

AFM

IJ39454

GPFS daemon crashes with logAssertFailed: !"Trash_Domain" in file tokenclass.C. (show details)

Symptom	Abend/Crash
Environment	All
Trigger	Unmount the file system for any reason.
Workaround	Disable this assert via mmchconfig disableAssert.

5.1.2.5

Core GPFS

IJ40064

Network instability triggering socket reconnects can cause certain IBM Spectrum Scale messages to be lost and not re-transmitted. Additionally, its network instability provokes a node failure, these lost messages can prevent the cluster from moving forward with the cluster-wide node leave protocol. This hang can prevent loss of cluster function including file system availability. (show details)

Symptom	Hang/Deadlock/Unresponsiveness/Long Waiters
Environment	Linux
Trigger	Node expel during socket reconnects.
Workaround	Restart the cluster.

5.1.2.5

ESS, GNR

IJ39013

In some large-scale deployment, high concurrent lseek(SEEK_HOLE) calls to a specific file might cause performance degradation. (show details)

Symptom	Performance Impact/Degradation
Environment	Linux
Trigger	High concurrent lseek(SEEK_HOLE) calls on the same file.
Workaround	If the lseek(SEEK_HOLE) is being invoked from a grep CLI, the '-a' option can bypass the lseek(SEEK_HOLE) call.

5.1.2.5

Core GPFS

IJ40410

When adding disks, the block allocation map is extended by adding new blocks. If a block already exists at the location but is outside the current file size, then this assert is hit. (show details)

Symptom	Node expel/Lost Membership
Environment	All
Trigger	New disk add.
Workaround	None

5.1.2.5

Core GPFS

IJ40411

Readdir fails on AFM+COS filesets with -gcs option as the directory entries are created with an incorrect type. (show details)

Symptom	Unexpected results
Environment	Linux
Trigger	AFM+COS fileset access with gcs option.
Workaround	None

5.1.2.5

AFM

IJ40414

When a filesetdf feature is enabled without the quota management, the df command on an independent fileset should return the values correponding to the file system instead of garbage. (show details)

Symptom	Random output from df command.
Environment	All
Trigger	df command on filesetdf enabled and no quota management file system.
Workaround	Enable quota management when using the filesetdf feature.

5.1.2.5

Quotas

IJ40464

The SUID and SGID bits are not cleared after a successful write or truncate to a file by a non-owner. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	Create a file with the SUID and SGID bits set. As a non-owner or non-root user, write to the file with the write() system call or truncate the file with the truncate() system call.
Workaround	Ensure that only owners can write to an executable binary file that has the SUID/SGID bit set.

5.1.2.5

Core GPFS

IJ40563

Create operation hitting error 2. (show details)

Symptom	Operation queue gets dropped.
Environment	Linux
Trigger	Error 2 hits and queue gets dropped.
Workaround	None

5.1.2.5

AFM COS

IJ40564

While mapping configured, AFM COS is not using the NON-MDS node(mapping) to replicate the write and create operations as part of queue executions. (show details)

Symptom	Replication is happening from MDS node only in mapping.
Environment	Linux
Trigger	NON-MDS node is not being used in mapping for create and write operations.
Workaround	None

5.1.2.5

AFM COS

IJ40565

AFM gateway daemon assert with (handlerListLock.isLocked() or DaemonShuttingDown) (show details)

Symptom	Crash
Environment	Linux
Trigger	AFM gateway node leaving the cluster.
Workaround	None

5.1.2.5

AFM

IJ40280

Potential for data integrity issues on all clusters using RDMA. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	Race condition between the RDMA software layer and IBM Spectrum Scale when reading data.
Workaround	Disable RDMA or set nsdCksumTraditional configuration parameter to "yes".

5.1.2.5

RDMA

IJ38554

Deadlock during AFM queue flush. (show details)

Symptom	Deadlock
Environment	Linux
Trigger	Stress testing
Workaround	None

5.1.2.4

AFM

IJ38784

While updating the symlink target path on an AFM enabled fileset, the inode is not copied to the previous snapshot causing the assert. (show details)

Symptom	Crash
Environment	Linux
Trigger	AFM caching with symlinks and snapshots.
Workaround	None

5.1.2.4

AFM

IJ38785

The SUID and SGID bits are not cleared after a successful write/truncate to a file by a non-owner. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	Create a file with the SUID and SGID bits set. As a non-root user or a non-group member user, write to the file with the write() system call or truncate the file with the truncate() system call.
Workaround	Ensure that only owners can write to an executable binary file that has the SUID/SGID bit set.

5.1.2.4

Core GPFS

IJ38786

Given a parent directory with the SGID bit set, a file created with the SGID bit specified by a user who does not belong to the same group as the directory can still have the SGID bit set. (show details)

Symptom	Unexpected Results/Behavior
Environment	All
Trigger	Create a file with the SGID bit specified as a non-member group user in a directory with the SGID bit set.
Workaround	Remove the SGID bit from the directory.

5.1.2.4

Core GPFS

IJ38807

Issuing io_uring IORING_OP_READ_FIXED requests to read data into preallocated buffers fails with an error. (show details)

Symptom	I/O error
Environment	Linux
Trigger	No pre-conditions are necessary.
Workaround	When using io_uring, use IORING_OP_READ instead of IORING_OP_READ_FIXED. This would require changing the application issuing the requests and might come at a performance penalty.

5.1.2.4

Core GPFS

IJ38808

Lookup fails on AFM NSD backend fileset root path if afmSyncNFSv4ACL option is set. AFM incorrectly tries to get NFSv4 ACLs on the remote cluster mount causing the failure. (show details)

Symptom	Unexpected results
Environment	Linux
Trigger	Using the afmSyncNFSv4ACL option where there exists NSD backend filesets.
Workaround	Unset the afmSyncNFSv4ACL option.

5.1.2.4

AFM

IJ38874

Today there is no command to bring an AFM Inactive fileset to active. (show details)

Symptom	AFM fileset moving to Inactive/Dropped states.
Environment	All
Trigger	Fileset moving to Inactive state and needing recovery for any reason.
Workaround	Wait for an I/O operation on the fileset orotouch a file inside the fileset to simulate an incoming I/O and trigger recovery on the fileset in question.

5.1.2.4

AFM

IJ38901

When the handler for AFM replication is created on the gateway node, the handler create time, the last replay time and the last sync time are all initialized to now time. If for some reason the handler couldn't go mounted and replicate to Home, this leads to AFM printing the last replay time as the same time as handler create time and gives a misconception that replication has actually happened. (show details)

Symptom	Error output
Environment	Linux
Trigger	Checking AFM replication handler for last replay and sync time, when there's a recovery pending and not happening on the fileset.
Workaround	None

5.1.2.4

AFM

IJ37068

The codepath for flushing file data to disk did not properly check for a stale file system, resulting in a crash. (show details)

Symptom	Abend/Crash
Environment	Linux
Trigger	With file descriptor open and kept open, have file system go stale (e.g. restart daemon). Then issue a request to flush the data to a file (or implicit flushOnClose).
Workaround	None

5.1.2.4

Core GPFS

IJ38963

FM fileset resync failed with EINVAL error (22). (show details)

Symptom	I/O error
Environment	Linux
Trigger	AFM fileset resync operation (mmafmctl command with resync subcommand).
Workaround	None

5.1.2.4

AFM

IJ38964

AFM Prefetch with --dir-list-file option where the list contains encoded directory names is not being processed and queued. (show details)

Symptom	Unexpected behavior.
Environment	Linux (AFM gateway node)
Trigger	Running prefetch (with or without --metadata-only option)using a list of encoded directory names - like the one generated from checkUncached (during mchfileset command run).
Workaround	Decode the directory list by hand and feed it to prefetch.

5.1.2.4

AFM

IJ38966

When running IO through KNFS and file audit logging enabled, an invalid pointer might be accessed. (show details)

Symptom	Abend/Crash
Environment	Linux
Trigger	Certain patterns of KNFS IO with file audit logging enabled.
Workaround	None

5.1.2.4

File audit logging

IJ38286

If a listfile's first entry is a directory then all operation are terminated because startmarker failed to setup. (show details)

Symptom	Command failed with invalid entries.
Environment	Linux
Trigger	First entry of a list file is a directory in the --list-file option.
Workaround	None

5.1.2.4

AFM

IJ38986

Kernel crash with kernel stack that shows the pemsIpmi functions. The RIP of the kernel crash shows RIP: 0010:kmem_cache_alloc_trace+0x7f/0x1c0. (show details)

Symptom	Kernel crash
Environment	Linux (x86_64)
Trigger	No special trigger.
Workaround	None

5.1.2.4

ESS, GNR

IJ38307

The given path for mmafmcosaccess doesn't check whether this path belongs to same fileset or not. Also it needs to check the FS and fileset consistency for the given command. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	The given path for the mmafmcoaccess command doesn't belong to same fileset but it is a valid path.
Workaround	None

5.1.2.4

AFM

IJ38997

Cached file is not revalidated in AFM local-updates mode. If the file is modified at home, these changes might not get pulled back into the cache. (show details)

Symptom	Unexpected results
Environment	Linux
Trigger	File read on AFM cached file in LU mode.
Workaround	Use AFM prefetch with --force option to cache the file again.

5.1.2.4

AFM

IJ38998

When afmSyncNFSv4ACL is set, ACL buffer size is not verified during the cache refresh. This causes the kernel to crash if the returned buffer length is zero. (show details)

Symptom	Crash
Environment	Linux
Trigger	AFM caching with afmSyncNFSv4ACL option.
Workaround	None

5.1.2.4

AFM

IJ39015

32bit GPFS API library not available in default path on Ubuntu. (show details)

Symptom	Error output/message
Environment	Linux (x86_64)
Trigger	Build an application with 32bit GPFS API library on Ubuntu.
Workaround	Modify the build process of the application to search for the 32bit GPFS API library in a different directory.

5.1.2.4

GPFS API

IJ39016

mmperfmon delete --expiredkeys fails with a timeout or exception. (show details)

Symptom	Error output/message
Environment	Linux
Trigger	Remote mounted filesystem with a slow or overloaded remote system.
Workaround	None

5.1.2.4

Performance monitoring

IJ39017

Daemon assert going off: endBufOffset >= 0 && endBufOffset < codeP-> getBufMaxPayload(endBuf). (show details)

Symptom	Abend/Crash
Environment	Linux
Trigger	A media error is discovered and fixed on an IBM ESS 3200 system that is using Flash Core Module NVMe drives on a specific virtual track boundary. Not all media errors will causes this crash.
Workaround	None

5.1.2.4

ESS, GNR

IJ39019

Kernel crash when required mount options are missing. (show details)

Symptom	Abend/Crash
Environment	Linux
Trigger	Issue a mount request where the dev= option is missing. Either remove that from /etc/fstab, or issue a mount command that does not read options from /etc/fstab, e.g.: mount -t gpfs /gpfs/fs1 /gpfs/fs1
Workaround	Always have the required dev= mount option available. This is the default in /etc/fstab.

5.1.2.4

Core GPFS

IJ39048

mmvdisk recovery group conversion may conflict with settings for nsdRAIDSmallBufferSize from the previous deployment scripts. mmvdisk will apply a value of -1 to this setting, which conflicts with the original value of 256KiB. The result is that the Daemon will print a warning message on start up, warning the user that nsdRAIDSmallBufferSize has been reduced to a value of 4KiB. This might impact performance. (show details)

Symptom	Error output/message, Performance Impact/Degradation
Environment	Linux
Trigger	mmvdisk recovery group conversion from the pre-2020 server config settings.
Workaround	Delete the old nsdRAIDSmallBufferSize setting of 256K in SDRFS, or delete any -1 values that were part of the mmvdisk rg conversion override.

5.1.2.4

ESS, GNR

IJ39049

When running mmhealth config monitor pause, followed by a mmhealth config monitor resume, the threshold component will stay in disabled state. (show details)

Symptom	Error output/message
Environment	All
Trigger	The issue occurs only if the node health monitoring was paused and resumed again.
Workaround	Execute the command "mmsysmonc enable thresholds".

5.1.2.4

System health

IJ39050

On Linux (two instances), kernel crash may occur after open() with O_CREAT flag is used and file has been opened already. (show details)

Symptom	Kernel crash
Environment	Linux
Trigger	Using open() with O_CREAT flag on system with Linux kernel 3.10 or higher.
Workaround	Avoid using open() with O_CREAT flag.

5.1.2.4

Core GPFS

IJ39057

Files are not fully cached on AFM COS filesets. (show details)

Symptom	Unexpected results
Environment	Linux
Trigger	File read on AFM uncached files.
Workaround	Use AFM prefetch to cache the files again.

5.1.2.4

AFM

IJ39058

Certain filenames that contained control characters were not properly escaped when logged by File audit logging / watch Folder json format. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	Creating a file with control characters in the name.
Workaround	None

5.1.2.4

File audit logging, Watch folder

IJ39059

The '-' char is incorrectly used for a range between two values. (show details)

Symptom	It doesn't report issue.
Environment	Linux
Trigger	When invalid char like ';' is also accepted.
Workaround	None

5.1.2.4

AFM

IJ39060

NFS status shown as 'unknown'. This might interfere with NFS fail over capabilities. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	None
Workaround	None

5.1.2.4

NFS, System health

IJ39117

An error 22 is hit when trying to get the valid data blocks on a file in resync. (show details)

Symptom	Unexpected Behavior
Environment	Linux (AFM gateway node)
Trigger	Running resync with uncached (possibly evicted) files at the SW cache site.
Workaround	None

5.1.2.4

AFM

IJ39089

Ganesha crashed with below stack: #012#5 0x00007f65b55a5e4e state_wipe_file (libganesha_nfsd.so.3.5) #012#6 0x00007f65b567787c _mdcache_lru_unref (libganesha_nfsd.so.3.5) #012#7 0x00007f65b56568e2 mdcache_put (libganesha_nfsd.so.3.5) #012#8 0x00007f65b565adea mdcache_put_ref (libganesha_nfsd.so.3.5) #012#9 0x00007f65b5619d73 open4_create_fh (libganesha_nfsd.so.3.5) #012#10 0x00007f65b561c451 open4_ex (libganesha_nfsd.so.3.5) #012#11 0x00007f65b561d6c0 nfs4_op_open (libganesha_nfsd.so.3.5) #012#12 0x00007f65b5604cca process_one_op (libganesha_nfsd.so.3.5) #012#13 0x00007f65b5605d46 nfs4_Compound (libganesha_nfsd.so.3.5) #012#14 0x00007f65b555b99c nfs_rpc_process_request (libganesha_nfsd.so.3.5) (show details)

Symptom	Ganesha Crash
Environment	All
Trigger	The problem might occur if there are a lot of small files with the same filename created/deleted from nfs clients at the same time.
Workaround	None

5.1.2.4

cNFS, CES NFS (All instances in feature tags)

IJ39119

Ganesha logs below messages. 2022-03-11 14:28:22 : epoch 0009016d : protocol2b : gpfs.ganesha.nfsd-14806[svc_37] GPFSFSAL_lookup : FSAL :CRIT :DOTDOT error, inode: 4308074499 2022-03-11 14:28:32 : epoch 0009016d : protocol2b : gpfs.ganesha.nfsd-14806[svc_48] GPFSFSAL_lookup : FSAL :CRIT :DOTDOT error, inode: 4308074499 (show details)

Symptom	DOTDOT error message in ganesha.log
Environment	All
Trigger	The problem might trigger if snapshot directory exists and its parent directory have the same inode number.
Workaround	None

5.1.2.4

cNFS, CES NFS (All instances in feature tags)

IJ39148

NFS mount point is not getting killed if home fileset is unresponsive or hung. This is causing multiple nfsmount to be created for the same fileset. (show details)

Symptom	Too much memory consumption on the NFS mount point.
Environment	Linux
Trigger	Gateway node is getting more memory consumption on the nfsmount due to existing multiple mount points of the fileset.
Workaround	None

5.1.2.4

AFM DR

IJ39201

Watch folder events could show an old path to a file if a directory in it's path had recently been renamed. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	Rename directories being watched.
Workaround	None

5.1.2.4

Watch folder

IJ36899

If the /etc/passwd file has multiple entries for the same UID, readdir fails while downloading the objects due to incorrect parsing of the UID. (show details)

Symptom	Unexpected results
Environment	Linux
Trigger	AFM+COS caching with duplicate entries in the passwd file
Workaround	Remove duplicate entries from /etc/passwd

5.1.2.4

AFM

IJ39203

mmafmcoskeys failed to set access and secret keys. (show details)

Symptom	Access and secret keys fail to set.
Environment	Linux
Trigger	Trying to set access and secret keys.
Workaround	None

5.1.2.4

AFM COS

IJ39274

In huge clusters (lot of perfomance data) and on systems with high load on the pmcollector / GUI node, performance queries might run into a 5s timeout This could lead to missing data in the GUI. (show details)

Symptom	Component Level Outage
Environment	Linux
Trigger	Huge clusters (lot of perfomance data) and on systems with high load on the pmcollector / GUI node
Workaround	None

5.1.2.4

Performance monitoring, GUI

IJ39280

After refresh interval, cache bit is getting reset while getobjmetats is triggered on cached file and finding ETAG mismatches. (show details)

Symptom	Files get evicted.
Environment	All
Trigger	Files get evicted because cache bit gets reset.
Workaround	None

5.1.2.4

AFM COS

IJ39282

AFM fails to upload the object if the name starts with a '-' character. (show details)

Symptom	Deadlock
Environment	Linux
Trigger	AFM+COS caching with special file names.
Workaround	None

5.1.2.4

AFM

IJ39283

If the system pool is also used for data, auto recovery mis-calculates avaiable metadata fg count and may trigger tsrestripefs -r wrongly. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	If the system pool is used for both data and metadata in a FPO cluster and if a disk/node failure causes the good failure group count to become less than the default metadata replication.
Workaround	Do not use system pool for data.

5.1.2.4

FPO

IJ39284

Deadlock might happen when the AFM gateway node leaves the cluster. (show details)

Symptom	Deadlock
Environment	Linux
Trigger	AFM gateway node leaving the cluster.
Workaround	None

5.1.2.4

AFM

IJ39316

Disk quota error is not reported when a readdir is happening at fileset root. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	readdir on a fileset.
Workaround	None

5.1.2.4

AFM COS

IJ39371

Stack corruption due to possible buffer overflow. (show details)

Symptom	mmfsd restart
Environment	Linux
Trigger	mmfsd restart at AFM gateway node.
Workaround	None

5.1.2.4

AFM

IJ39011

Online replica compare function could incorrectly flag mismatch on the last block of a file when the block was preallocated as a full block and reduced to fragment later. (show details)

Symptom	Unexpected Results/Behavior
Environment	All
Trigger	Run online replica compare on files with preallocated blocks.
Workaround	Avoid running online replica compare.

5.1.2.4

Core GPFS

IJ39400

The IBM Spectrum Scale admin commands and handling of file system encryption keys require the use of more robust settings. (show details)

Symptom	None
Environment	All
Trigger	None
Workaround	None

5.1.2.4

Admin commands

IJ39415

GPFS recovery is blocked after cables are pulled and put back, due to a RPC being sent while taking GPFS dumps. (show details)

Symptom	Hang
Environment	Linux (ESS systems)
Trigger	Pull cables and then put the cables back.
Workaround	None

5.1.2.4

Core GPFS

IJ39437

Command mmlspdisk produces printf arithmetic syntax under non-US locale. (show details)

Symptom	Error output/message
Environment	All
Trigger	Run mmlspdisk under locale that uses decimal comma.
Workaround	Run mmlspdisk in C or en_US locale.

5.1.2.4

Admin commands

IJ39438

In huge clusters (lot of perfomance data) and on systems with a high load on the pmcollector / GUI node, perfomance queries might run into a 5s timeout. This could lead to missing data in the GUI. (show details)

Symptom	Component Level Outage
Environment	Linux
Trigger	Huge clusters (lot of perfomance data) and on systems with high load on the pmcollector / GUI node
Workaround	None

5.1.2.4

Performance monitoring, GUI

IJ39440

Signal 11 in fetch_and_add() on nsdHoldCount. (show details)

Symptom	Abend/Crash
Environment	All
Trigger	Disk scan.
Workaround	None

5.1.2.4

NSD

IJ39449

Pems hang due to no ipmi recv slots, so no new ipmi command will be sent to BMC. When pems hangs, it will generate these lines at dmesg or /var/log/messages ERROR: no open ipmi recv slots pems_mod:[E]:0136:0581:failed to enq cmd rc=0xfffffff0 pemsIpmiEnqueueCmd failed to enq setting QUEUE_FULL pems_mod:[E]:0136:0315:failed to send cmd to backend interface rc=-16 pems_mod:[E]:0136:0581:failed to enq cmd rc=0xfffffff0 You will see the last 2 prints over and over. (show details)

Symptom	pems hang generating a lot messages at dmesg.
Environment	Linux (x86_64)
Trigger	It is a small hole at pems ipmi receive handler that it can happen at any time in ESS3200.
Workaround	Restart pems module and restart the ess3200_pemscfg service.

5.1.2.4

ESS, GNR

IJ39455

Remove from displaying and prevent adding un-supported ciphers to cipherList. The following ciphers are affected: AES128-SHA AES256-SHA (show details)

Symptom	None
Environment	All
Trigger	Use un-supported ciphers.
Workaround	Don't use unsupported ciphers.

5.1.2.4

Admin commands

IJ37100

The output of "mmperfmon query" gives incomplete data if the names contain a blank. (show details)

Symptom	Error output/messages
Environment	All (with perfomance monitoring installed)
Trigger	The broken text appears for entries containing blanks.
Workaround	None

5.1.2.3

System health

IJ37227

Daemon assert going off when generating DMAPI event: addr.isReserved() || addr.getClusterIdx() == clusterIdx in file cfgmgr.h, resulting in a daemon crash. (show details)

Symptom	Abend/Crash
Environment	All
Trigger	DMAPI is enabled and a remote cluster is used while a DMAPI event is being generated after a remote client node left the cluster.
Workaround	None

5.1.2.3

DMAPI

IJ37231

If NFSv4 client holds a file lock for read/write operations, then client may report I/O error after CES-IP failover. (show details)

Symptom	I/O error
Environment	All
Trigger	If NFSv4 client holds a file lock for write operation, then CES-IP failover from current active NFS server(lets say protocol node1) to other server (protocol node2) may cause I/O failure on client.
Workaround	None

5.1.2.3

IJ37235

Missing sqlite-3 packages on IBM Spectrum Scale Erasure Code Edition environments can cause admin command hangs. (show details)

Symptom	Hang
Environment	All
Trigger	Problem occurs in an IBM Spectrum Scale Erasure Code Edition environment when the sqlite-3 package is installed on some nodes but not others.
Workaround	None

5.1.2.3

Admin commands

IJ37246

EPERM is incorrectly returned for non-existing ioctl requests. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	Issuing an invalid ioctl request to a file in GPFS.
Workaround	NA

5.1.2.3

Core GPFS

IJ37256

There is a chance of a kernel crash with kernel stack with pemsIpmi functions. The RIP of the kernel crash may show RIP: 0010:kmem_cache_alloc_trace+0x7f/0x1c0. (show details)

Symptom	Kernel crash
Environment	Linux (x86_64)
Trigger	No specific trigger; issue occured in normal good path run.
Workaround	None

5.1.2.3

ESS, GNR

IJ36533

Discrepancy quota usage from fileset based quota check. (show details)

Symptom	Unexpected Results/Behavior
Environment	All
Trigger	Fileset level quota check
Workaround	Switch to file system level quota check.

5.1.2.3

Quotas

IJ37260

Running workloads with many lookups done to GPFS in a highly concurrent way has a performance impact. (show details)

Symptom	Performance Impact/Degradation
Environment	Linux
Trigger	Lots of lookups to the file system. One known case is setting LD_LIBRARY_PATH to directories on a GPFS file system on zLinux. The zLinux dynamic linker issues a much higher number of lookups for each entry in LD_LIBRARY_PATH, making this scenario more likely to occur.
Workaround	Reduce the number of concurrent lookups.

5.1.2.3

Core GPFS

IJ36554

In fileset level mmcheckquota, if no free inode is left in the fileset (all inodes are allocated), when calculating the max inode number for the fileset, the last inode number is miss counted, which causes 1 inode usage discrepancy for the fileset quota. (show details)

Symptom	Unexpected Results/Behavior
Environment	All
Trigger	Fileset level quota check
Workaround	Switch to a file system level quota check.

5.1.2.3

Quotas

IJ37280

The assert goes off and the following message is shown in the mmfs.log: Assert exp(isUnlinked() || DaemonShuttingDown) (show details)

Symptom	Abend/Crash
Environment	All
Trigger	Sender timeout while sending an RPC
Workaround	None

5.1.2.3

Core GPFS

IJ36532

When there are multiple threads trying to flush the same file and the file is large with many blocks, there could be mutex contention which can lead to performance degradation. (show details)

Symptom	Performance Impact/Degradation
Environment	All
Trigger	Multiple threads trying to flush the same large file.
Workaround	Reduce the number of worker threads.

5.1.2.3

Core GPFS

IJ37350

AFM prefetch might get stuck during the queuing phase if the list file has duplicate entries. This happens because a waiting thread is not notified after the read completion. (show details)

Symptom	Deadlock
Environment	Linux
Trigger	AFM prefetch
Workaround	Remove duplicate entries from the list file.

5.1.2.3

AFM

IJ37356

Inodes are not reclaimed after the hardlinks are corrected during the AFM prefetch. This causes more inodes to be in-use than actual number of files present in the fileset. (show details)

Symptom	Unexpected results
Environment	Linux
Trigger	AFM prefetch
Workaround	None

5.1.2.3

AFM

IJ37511

An error message "Could not retrieve minReleaseVersion" is logged in the systemhealth monitor log file (mmsysmonitor.log). (show details)

Symptom	Error output/messages
Environment	All (with performance monitoring installed)
Trigger	The error message is logged whenever a mmperfmon query is executed.
Workaround	None; The error message can be ignored.

5.1.2.3

System health

IJ37542

On Linux kernel 3.10 or later, if the O_TRUNC flag is used and the file has been opened already, the O_TRUNC flag might be incorrectly ignored. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	Using open() with O_CREAT and O_TRUNC flags on a system with Linux kernel 3.10 or later.
Workaround	Avoid using open() with O_CREAT and O_TRUNC flags.

5.1.2.3

Core GPFS

IJ37679

mmbackup uses IBM Spectrum Protect BA client command 'dsmc' to communicate with the IBM Spectrum Protect Server. If the -server option is not given to dsmc, dsmc gets a default server name from the Protect client configuration. If --tsm-servers <servername> to mmbackup is different from the default server name and the default server is not functional, mmbackup could show unexpected behavior because mmbackup does not provide the -server option in one of the dsmc query calls. (show details)

Symptom	Component Level Outage
Environment	All
Trigger	Run mmbackup when --tsm-servers <server> is not the same as the default servername in dsm.opt and default server is not functional.
Workaround	Make sure that the --tsm-servers <server> is the same as the default servername in dsm.opt.

5.1.2.3

mmbackup

IJ37747

When adding a new disk to a file system, health monitoring will raise an ill_unbalanced_fs degraded health event as the file system will be unbalanced. This degraded health event does not reflect the current recommendation of when to use the mmrestripefs command, and so the degraded health event's severity is to sever and is being changed from a degraded severity level to being a TIP severity level. (show details)

Symptom	Error output/messages
Environment	All
Trigger	Adding a new disk to an existing file system.
Workaround	The ill_unbalanced_fs event can be added to the "ignore events" list in the mmsysmonitor.conf file. After mmsysmon is restarted, this event will be ignored by mmhealth and will not cause any unbalanced file systems to show as being degraded.

5.1.2.3

System health

IJ37784

When a fileset is in chmodAndUpdateAcl permission change mode, creating a file with the open() system call under a parent directory with inherit entries and changing permissions of the newly created file with NFS results in duplicated and incorrect entries in the file's NFSv4 ACL. (show details)

Symptom	Unexpected Results/Behavior
Environment	All
Trigger	Have a fileset in chmodAndUpdateAcl permission change mode and a parent directory with inherit entries. Using NFS, create a file with the open() system call and change the permissions of the file with chmod.
Workaround	Use chmodAndSetAcl permission change mode for filesets and avoid having inherit entries in the parent directory.

5.1.2.3

NFS

IJ37493

Rename fails with error 766 on AFM+COS fileset if the file is moved from a local directory to a non-local directory. (show details)

Symptom	Unexpected results
Environment	Linux
Trigger	Renaming local file to non-local directory.
Workaround	None

5.1.2.3

AFM COS

IJ37870

With afmFastCreate enabled on IW fileset, AFM recovery fails. (show details)

Symptom	Unexpected Behavior
Environment	Linux (AFM gateway nodes)
Trigger	Running Recovery on AFM IW mode filesets with afmFastCreate enabled and changes being made at cache and home simultaneously.
Workaround	None

5.1.2.3

AFM

IJ37787

SGNotQuiesced assertion in dbshLockInode during file system quiesce. (show details)

Symptom	Abend/Crash
Environment	All
Trigger	Operations which do file system quiesce
Workaround	None

5.1.2.3

Snapshots

IJ37104

POSIX permission denied program error (show details)

Symptom	Permission denied on open for new file
Environment	Linux
Trigger	A file mode for creation which is not correctly translated by newer kernels.
Workaround	None

5.1.2.3

API

IJ37790

Trying to add char '=' and '-' in akey/skey is failing with invalid key. (show details)

Symptom	Failed with invalid key
Environment	Linux
Trigger	Setting up the skey/akey in the mmafmcoskey command.
Workaround	None

5.1.2.3

AFM

IJ37838

mmap reads from lots of threads may cause a deadlock in DeclareResourceUsage. (show details)

Symptom	Hang/Deadlock/Unresponsiveness/Long Waiters
Environment	All
Trigger	mmap reads from lots of threads
Workaround	Disable mmap pagepoolresource usage declaration by the "mmchconfig mmapDeclarePageUsage=false" command.

5.1.2.3

Core GPFS

IJ37854

When SGPanic occurs, the dealloc queue subblocks count could be wrong and cause "(deallocHighSeqNum - deallocFlushedSeqNum) >= deallocQueueSubblocks" assertion failure. (show details)

Symptom	Abend/Crash
Environment	All
Trigger	In rare case, the block deallocation around SGPanic time might cause this assertion.
Workaround	None

5.1.2.3

Core GPFS

IJ37882

Due to a change in procps output in Cygwin version 3.3, IBM Spectrum Scale fails to start. (show details)

Symptom	Unexpected Results/Behavior
Environment	Windows (x86_64)
Trigger	IBM Spectrum Scale startup
Workaround	Downgrade Cygwin.

5.1.2.3

Core GPFS

IJ35881

While trying to set extended attributes, SetXAttrHandlerThread could deadlock with itself trying to obtain a WW lock on the buffer while holding XW lock. (show details)

Symptom	Hang/Deadlock/Unresponsiveness/Long Waiters
Environment	All
Trigger	Changing extended attributes on a file or directory
Workaround	None

5.1.2.2

Core GPFS

IJ36110

AFM does not allow the character '=' as part of a secret key. (show details)

Symptom	Error message
Environment	Linux
Trigger	Using special characters as part of a secret key
Workaround	None

5.1.2.2

AFM

IJ36246

When running file audit logging, signal 11 is possible at FileMetadata::set_mtimeUpdate(unsigned int) (show details)

Symptom	Signal 11
Environment	Linux
Trigger	Daemon crash
Workaround	None

5.1.2.2

File audit logging

IJ36250

On HAWC enabled file systems, a deadlock could occur when a data block is being modified at the same time as log wrap is working on log records for the same block. (show details)

Symptom	Hang/Deadlock/Unresponsiveness/Long Waiters
Environment	All
Trigger	Multiple writes to the same block on a HAWC enabled file system.
Workaround	Disable HAWC feature on the file system.

5.1.2.2

HAWC

IJ36299

If the number of quorum nodes in the cluster is not greater than the minQuorumNodes configure setting, the mmchconfig command fails without a clear message. (show details)

Symptom	Error message
Environment	All
Trigger	Problem arises when minQuorumNodes configure value is greater than or equal to the number of quorum nodes in the cluster.
Workaround	If setting the tiebreakerDisks parameter fails because the number of quorum nodes in the cluster is not greater than minQuorumNodes, use the mmchconfig command to set minQuorumNodes to the default value or a value lower than the number of quorum nodes in the cluster.

5.1.2.2

Admin commands

IJ36462

Failed to create the RG. (show details)

Symptom	Unexpected Results/Behavior
Environment	All
Trigger	Using mmvdisk to create RG, and there is an SSD SATA disk.
Workaround	1. Before the creation of the RG, run 'chmod a-x /usr/lpp/mmfs/bin/gems/tscompattr' 2. Once the RG is created, run 'chmod a+x /usr/lpp/mmfs/bin/gems/tscompattr'

5.1.2.2

ESS, GNR

IJ36511

Certain characters such as newline (\n) or backslash (\), etc were not escaped correctly resulting in invalid JSON. JSON parsers are not be able to read the event correctly. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	Filenames, acls, or xattrs with escape characters
Workaround	You can programmatically escape existing events to create valid JSON before the parser tries to ingest the event.

5.1.2.2

File audit logging, Watch folder

IJ36512

If a workload involves opening and creating lots of files concurrently under the same directory, some of the open operations may suffer high open times. (show details)

Symptom	Performance Impact/Degradation
Environment	Windows (x86_64)
Trigger	Workload that creates and opens many file concurrently in the same directory path.
Workaround	None

5.1.2.2

Core GPFS

IJ36513

Assert exp(ecDataBuffersPerTrack+ecParityBuffersPerTrack == ecParityBufferIndexByStrip[ecNPdisks]) (show details)

Symptom	Abend/Crash
Environment	Linux
Trigger	LG resign
Workaround	None

5.1.2.2

ESS, GNR

IJ36529

While down loading the files without afmObjectACL enabled, its taking the default permission 700 which is getting a mismatch with the default permission 770 of file set root. (show details)

Symptom	Default permission for files does not match with fileset root.
Environment	Linux
Trigger	Non consistent default permission value across the fileset.
Workaround	None

5.1.2.2

AFM

IJ36531

The position of the preventSnapshotRestore value is incorrectly read while loading the mmbackupconfig file. The position is off by four values. The correct information is saved from mmbackupconfig. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	mmbackupconfig was run on 5.1.2 with watch folder or file audit logging enabled and mmrestoreconfig is being run that would restore the associated filesets of watch folder / file audit logging.
Workaround	Run mmrestoreconfig again once the system is updated to a release with the fix.

5.1.2.2

mmbackupconfig, file audit logging, watch folder

IJ36709

AFM directory prefetch fails to populate hardlinks, this causes hardlinks to be created as different files at the cache. (show details)

Symptom	Unexpected results
Environment	Linux
Trigger	AFM prefetch
Workaround	None

5.1.2.2

AFM

IJ36563

When AFM COS replication is happening on any one of the filesets in the file systems, if there is any other fileset that is attempting to link/unlink or create/delete a snapshot, then there can be a deadlock. (show details)

Symptom	Deadlock
Environment	Linux (AFM gateway node)
Trigger	Create/delete snapshot on a fileset or link/unlink a fileset on a file system where one or more AFM COS filesets are replicating to the remote COS site.
Workaround	None

5.1.2.2

AFM

IJ36818

There has been a vulnerability found in Apache Log4j2 library v2.16.0 used by Scale/ESS GUI. Apache Log4j2 versions 2.0-alpha1 through 2.16.0 (excluding 2.12.3) did not protect from uncontrolled recursion from self-referential lookups. (show details)

Symptom	Unexpected Results/Behavior
Environment	All
Trigger	Third Party Advisory released by Apache
Workaround	None

5.1.2.2

Core GPFS

IJ36855

The IBM Spectrum Scale HDFS Transparency connector version 3.1.0-9, 3.1.1.7 and 3.3.0-0 contain Apache Log4j libraries that are affected by the security vulnerabilities CVE-2019-17571 and CVE-2021-4104. (show details)

Symptom	NA
Environment	All
Trigger	The IBM Spectrum Scale HDFS Transparency connector is not vulnerable in default configurations.
Workaround	Manually patch affected log4j libraries.

5.1.2.2

HDFS Connector

IJ36349

GPFS daemon could assert while running mmadddisk. This can only happen if a new storage pool is being created as a result of running mmadddisk and a storage pool had been deleted in the past via mmdeldisk. (show details)

Symptom	Abend/Crash
Environment	All
Trigger	Creating a new storage pool with the mmadddisk command.
Workaround	Increase number of disks being added with the mmadddisk command or avoid creating a new storage pool.

5.1.2.2

Core GPFS

IJ36895

"More than 22 minutes searching for a free buffer in the pagepool" assertion failure. (show details)

Symptom	Abend/Long Waiters
Environment	All
Trigger	This problem is more likely to occur in a cluster which has file systems with both large block size and small block size (compared to scatter buffer size)
Workaround	Change 'scatterBufferSize' config to a smaller size.

5.1.2.2

Core GPFS

IJ35443

There are regular error messages '/sbin/ibportstate: Failed to open' in the systemhealth monitor log. (show details)

Symptom	Error output/message
Environment	All
Trigger	There are regular error messages '/sbin/ibportstate: Failed to open' in the systemhealth monitor log.
Workaround	None

5.1.2.1

System health

IJ35444

AFM Independent filesets with dependent filesets linked inside them have a chance of hitting a deadlock. (show details)

Symptom	Deadlock
Environment	Linux (AFM gateway nodes)
Trigger	Trigger relationship initialization on an AFM independent fileset with dependent filesets inside them. At the same time, the remote site being bad causing the relationship to be put into a bad state.
Workaround	None

5.1.2.1

AFM

IJ35448

Ubuntu machines are reported with a network health issue of "ib_rdma_libs_wrong_path", even when the required libraries are installed. (show details)

Symptom	Error output/message
Environment	Ubuntu (using Infiniband/RDMA)
Trigger	Since Debian/Ubuntu introduced Multiarch Architecture Specifiers, most libraries live in a /usr/lib/XXXX-linux-gnu/ directory (where XXXX describes the architecture). The initial check procedure considered only the usual library paths, like /usr/lib64 and /usr/lib.
Workaround	None

5.1.2.1

System health

IJ35449

When running tail -f on an audit log from a node that is not the writing node, tail -f will not show newly written events. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	Running tail -f on an audit log.
Workaround	None

5.1.2.1

File audit logging

IJ35466

When a subfolder or audit log is created under the File Audit Logging fileset, it inherits a default selinux security context. This default value does not allow applications such as rsyslog the ablility to read the audit log contents. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	NA
Workaround	NA

5.1.2.1

File audit logging

IJ35486

logAssertFailed: exp(vrsP->index == index) (show details)

Symptom	Abend/Crash
Environment	Linux
Trigger	This assertion may occur when GPFS detects and breaks a hung RDMA request.
Workaround	None

5.1.2.1

RDMA

IJ35487

mmlsquota is reporting wrong results with:

1. extra output lines with "no limits" for users or groups that don't have usage on the fileset

2. extra output lines, all showing "no limits" when no limits (quotas) are set for a user or group in the fileset

(show details)

Symptom	Unexpected Results/Behavior
Environment	All
Trigger	Issuing mmlsquota when perfileset-quota is enabled.
Workaround	None

5.1.2.1

Quotas

IJ35318

IBM Spectrum Scale ships several ilm samples. One of them is the mmfind tool and to use the tool, findUtil_processOutputFile.c needs to be compiled. But the compilation of findUtil_processOutputFile.c fails on some Linux distributions. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	Compiling mmfindUtil_processOutputFile.c
Workaround	Modify mmfindUtil_processOutputFile.c before compiling it.

5.1.2.1

Admin commands

IJ35537

A newly mounting node either due to user mount or an expelled node rejoining the cluster can fail assert 'llfP->lockRangeNode != NodeAddr(-1U, 0, NodeAddr::naNormal)' if it happens in the middle of an mmrestripefs, mmaddisk, mmdeldisk, or mmfsck operation. (show details)

Symptom	Node expel/Lost Membership
Environment	All
Trigger	Mounted node failure in the middle of mmrestripefs.
Workaround	None

5.1.2.1

Core GPFS

IJ35567

When using RDMA via RoCE, there are certain network error scenarios where not all possible RDMA connections from a NSD client to a NSD server are established. (show details)

Symptom	Network Performance
Environment	Linux
Trigger	- the NSD server port has no IP address assigned. - RDMA Connection Manager address or route resolution fails. - RDMA Connection Manager connection request fails.
Workaround	None

5.1.2.1

RDMA

IJ35578

When GDS is disabled, the RDMA subsystem may post GDS related error messages even though everything is working correct. (show details)

Symptom	Documentation Problem
Environment	Linux
Trigger	- GPU Direct Storage support is disabled. - libmlx5.so is not installed on the system or libmlx5.so is downlevel.
Workaround	None

5.1.2.1

RDMA

IJ35598

GPFS API calls from 32-bit application fail on SLES 15 SP3. (show details)

Symptom	Error output/message
Environment	Linux (x86_64 and s390x)
Trigger	Running on SLES 15 SP3 and an application trying to issue 32-bit GPFS API calls.
Workaround	Apply the fix manually by editing the file /usr/lpp/mmfs/src/gpl-linux/ss.c to remove the checks for HAVE_COMPAT_IOCTL, then run mmbuildgpl again, and restart GPFS.

5.1.2.1

GPFS API

IJ35140

Daemon crash getting AFM statistics from the mmdiag command. (show details)

Symptom	mmfsd daemon crash
Environment	Linux
Trigger	AFM stats collection using the mmdiag command
Workaround	Reset AFM stat counter frequently using the mmdiag command.

5.1.2.1

AFM

IJ35622

If there are node failures during burst of file create or delete activity, then it is possible for the cached free inode counters on the file system manager to become out of date. (show details)

Symptom	Error output/message
Environment	All
Trigger	Node failures in the middle of large number of file creates or deletes
Workaround	Run 'mmfsadm test imapWork <fs> inodeManager' or 'mmchmgr <fs> <another node>'.

5.1.2.1

Core GPFS

IJ35686

Getattr failed to perform file validation with home if afmObjectXattr flag is enabled and due to this it is unable to refresh the metadata of files at cache on home file changes as part of lookup in LU mode. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	Metadata mismatches on a afmObjectXattr enabled fileset
Workaround	None

5.1.2.1

AFM

IJ35789

When a single node is unavailable during 'mmauth genkey new', it results in GPFS (mmfsd) not starting on this node. (show details)

Symptom	GPFS does not start
Environment	All
Trigger	Issuing 'mmauth genkey new'
Workaround	To update the node which was unavailable during 'mmauth genkey new' with the latest key files the following command must be attempted on a node which was available during 'mmauth genkey new': mmauth genkey propagate -N <NODE_UNAVAILABLE_DURING_MMAUTH_GENKEY_NEW>

5.1.2.1

GPFS startup, Admin commands, CCR

IJ35792

When using 'mmqos class delete', there is a check to prevent deleting of a class that is referenced or used by a throttle object. The current error does not make this clear. (show details)

Symptom	Error output/message
Environment	Linux
Trigger	Throttle objects that use the class you want to delete
Workaround	Remove any throttle objects that use the class you want to delete.

5.1.2.1

QoS

IJ35751

AFM gateway node crashes during the fileset recovery because invalid file handle are used to get inodes in the kernel. (show details)

Symptom	Crash
Environment	Linux
Trigger	AFM fileset recovery
Workaround	None

5.1.2.1

AFM

IJ35795

Triggering a ChangeSecondary for a DR Primary mode fileset to the same target inband or triggering a Resync on a SW fileset which is in unmounted state, with resyncV2 feature enabled can cause the resync/changeSecondary to fail and not proceed. (show details)

Symptom	Unexpected Behavior
Environment	Linux (AFM gateway nodes)
Trigger	Triggering a ChangeSecondary for DR Primary mode fileset to the same target inband or triggering a Resync on a SW fileset which is in unmounted with resyncV2 feature enabled.
Workaround	Disable ResyncV2 and run ResyncV1 to get changeSecondary or Resync commands to work on the DR/SW filesets.

5.1.2.1

AFM

IJ35791

When IO workload is running, such as NSD read on a general GPFS, ECE, or ESS, the TCP connection may be incorrectly reset. If all the connections to the peer node are reset, this will trigger the node to be expelled. (show details)

Symptom	Node expel/Lost Membership
Environment	All
Trigger	Large IO read in progress
Workaround	None

5.1.2.1

Core GPFS

IJ35796

Slow readdir and lookup performance on AFM caching mode filesets under heavy workload (show details)

Symptom	Slow IO
Environment	Linux
Trigger	AFM caching with heavy workload
Workaround	Restart AFM gateway node.

5.1.2.1

AFM

IJ35797

Sometimes stealing threads are not started in time to steal buffers for I/O threads which may degrade performance. (show details)

Symptom
Environment	Linux
Trigger	The problem may be triggered with heavy I/O workload.
Workaround	Remove any throttle objects that use the class you want to delete.

5.1.2.1

ESS, GNR

IJ35809

When there is no mmqos configuration and the command 'mmqos report list -Y' is run, it shows mmlsqos instead of mmqos in the output. (show details)

Symptom	Error output/message
Environment	Linux
Trigger	No mmqos data configured
Workaround	NA

5.1.2.1

QoS

IJ35838

When the last block of a file is not a full GPFS block, replica compare function could report false replica mismatch. (show details)

Symptom	Error output/message
Environment	All
Trigger	Running replica compare with mmrestripefs or mmrestripefile.
Workaround	None

5.1.2.1

Core GPFS

IJ35851

Customer may run the cluster with unsupported quorum/tiebreaker disk configuration. (show details)

Symptom	Cluster runs with unsupported quorum/tiebreaker disk configuration
Environment	Linux
Trigger	Unsupported quorum/tiebreaker disk configuration
Workaround	None

5.1.2.1

Core GPFS

IJ35941

When IO workload is running, such as NSD read on a general GPFS, ECE, or ESS, the TCP connection may be incorrectly reset. If all the connections to the peer node are reset, this will trigger the node to be expelled. (show details)

Symptom	Node expel/Lost Membership
Environment	All
Trigger	Large IO read in progress
Workaround	None

5.1.2.1

Core GPFS

IJ36065

In a mixed cluster which contain 5.1.2.0 and pre-5.1.2.0 nodes, if a quota function is enabled on a file system with a format version that is lower than 4.1.1.0, the GPFS daemon on the quota client node may crash with signal 11. The following dump stack is shown in mmfs.log:

2021-11-04_14:43:19.968+0100: [E] Signal 11 at location 0x55D8E19220A5 in process 28081, link reg 0xFFFFFFFFFFFFFFFF.2021-11-04_14:43:20.867+0100: [D] Traceback: 2021-11-04_14:43:20.868+0100: [D] #0: 0x000055D8E19220A5 QuotaClient::sendQuotaShareRequest(QuotaEntryClt*, QuotaShare*, unsigned int, unsigned int, unsigned int*, int, long long, long long) + 0x6D5 at ??:0

(show details)

Symptom	Abend/Crash
Environment	All
Trigger	Code bug in 5.1.2 GA build
Workaround	Disable the quota function; or upgrade the file system version to a value larger than or equal to 4.1.1.0; or upgrade the pre-5.1.2.0 node to 5.1.2.0.

5.1.2.1

Quotas

IJ35924

If a proxy is configured for the CALLHOME component of IBM Spectrum Scale, the system health component CALLHOME (mmhealth node show) will become DEGRADED causing the PTF_Updates check to fail. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	A proxy is configured for CALLHOME
Workaround	Either do not use a proxy setup for callhome, or disable the PTF_update check as follows: 1. On the callhome master node, edit the /var/mmfs/mmsysmon/mmsysmonitor.conf file in the "[ptfupdates]" section to set the value "monitors_enabled = false". 2. Restart the system health monitoring with: mmsysmoncontrol restart

5.1.2.1

Call home

INFO001

The release of v5.1.2.0 aligned with the release of v5.1.1.4 PTF. Please refer to v5.1.1.x for list of APARs through v5.1.1.4.

5.1.2.0

INFO

IJ35469

Upgrade from version prior to 5.1.1 then back out may cause CCR to stop working. (show details)

Symptom	Error output/message Cluster/File System Outage Upgrade/Install failure
Environment	All
Trigger	The problem occurs when a cluster upgrade to GPFS version 5.1.1 or later then back out of the upgrade. This only happens if the cluster has authorized cluster or remote cluster defined.
Workaround	Restore node by using mmsdrrestore.

5.1.2.0

Core GPFS

IJ34398

ACL changed when running AFM failover in SW (show details)

Symptom	Fileset root ACL gets changed
Environment	Linux
Trigger	Some fileset is being recreated or the fileset root metadata is changed while running failover.
Workaround	None

5.1.2.0

AFM