IBM Storage Scale APARs Resolved in 5.1.8.x

IJ47794

Assert at Secondary/Home due to mismatch between inode in ofP and dentry while doing operation on AFM control file. (show details)

Symptom	Scale restart on Secondary/Home.
Environment	Linux Only
Trigger	AFM replication and SG_PANIC at the Gateway node due to unresponsive home.
Workaround	None

5.1.8.2

AFM

IJ47854

Unable to join to AD domain during authentication setup. (show details)

Symptom	Error output/message
Environment	Linux Only
Trigger	Fresh/new auth creation for AD.
Workaround	None

5.1.8.2

Authentication

IJ47857

mmsysmon: exception: 'NoneType' object has no attribute 'addEvent' occurs in /var/adm/ras/mmsysmonitor.log. No functionality is affected by this. (show details)

Symptom	Error output/message
Environment	Linux Only
Trigger	Race conditions when restarting the mmsysmon deamon
Workaround	None

5.1.8.2

System Health

IJ47917

f nvme card is not probing and you see this message at dmesg:pems_mod:[I]:0125:0648:Failed to open /sys/devices/pci0000:00/0000:00:01.5/0000:05:00.0/nvme/ rc=0
The rc=0 is wrong we should pass an error rc so when we get a new hotplug event then we try to read again sysfs files so the card is available for gems. So after the hotplug the card is available again at nvme --list but not showing up at tslsenclslot output. (show details)

Symptom	The nvme card will not be available if you do tslsenclslot but it is available in nvme --list.
Environment	ESS3200/ESS3500
Trigger	The nvme card was not probing at the time pems tried to read the sysfs files.
Workaround	Restart pemsmod module so it will read the sysfs files when the module loads.

5.1.8.2

ESS Platform

IJ47858

When trying to access files that have been created inside a directory, access may be denied on certain nodes because there is a negative dentry associated with the entry cached in the system, even though the dentry should be positive. (show details)

Symptom	Unexpected Results/Behavior / IO error
Environment	ALL Linux OS Environments
Trigger	Race condition with unlink and dentry revalidation
Workaround	When the issue is hit, for the client node to correctly recognize files that exist but have a negative dentry associated with it, the dentry cache must be dropped, or the filesystem unmounted/remounted.

5.1.8.2

All Scale Users

IJ47918

Kernel assert at function pemsBackEndWork while calling pemsListDequeueFirst due to NULL pointer resulting in a kernel panic. (show details)

Symptom	Crash
Environment	ESS3200/ESS3500
Trigger	This a timing issue so really just run workloads at ESS3200/ESS3500.
Workaround	None

5.1.8.2

ESS Platform

IJ47853

Among the cluster nodes there is at least one network connection from node to node internally established.
It may happen that one or up to all of these connections fail for any reason, but will be reconnect automatically by the backend. Those reconnect-events are detected by the systemhealth monitor, which does also some bookkeepingabout the connection states. (show details)

Symptom	Unexpected Results/Behavior
Environment	ALL Operating System environments in a larger cluster
Trigger	There seem to be no clear steps to reproduce the issue. It was observed during system setup, but also at runtime. Basically it occurs at situations where node-to-node connectios are dropped andrecreated by either one of the affected nodes. But this is not always the case. The created connection state events should not show an issue if the connections are fully restored again.
Workaround	Restart the systemhealth monitor (mmsysmoncontrol restart).This clears the internal connection state bookkeeping.

5.1.8.2

System Health

IJ47855

A larger SNC cluster showed massive network traffic caused by the disk monitor module of the systemhealth monitor. (show details)

Symptom	Unexpected Results/Behavior
Environment	ALL Operating System environments in a larger cluster
Trigger	The systemhealth monitor was using the command mmlsdisk with a large amount of disk names as argument, which caused the internal traffic.
Workaround	None

5.1.8.2

System Health

IJ47856

Unable to start Samba services after joining a node to CES existing cluster. (show details)

Symptom	Error output/message
Environment	Linux Only
Trigger	Fresh/new addition of node to existing CES cluster with samba enabled.
Workaround	Start samba services manually after node addition to cluster.

5.1.8.2

Samba

IJ47859

The current state of the PSUs are not show by RAS Events for Utility Node. (show details)

Symptom	No information about Utlity Node.
Environment	ESS Utility Node Only
Trigger	Occurs always on Utility Node.
Workaround	None

5.1.8.2

RAS Events / Health monitoring

IJ47860

With ESS6000 and Utility Node it is possible that the same event is raised from two different sources. (show details)

Symptom	RAS Events are changing their status frequently.
Environment	ESS 6000
Trigger	It may occur on ESS6000.
Workaround	None

5.1.8.2

RAS Events / Health monitoring

IJ47861

The pair canister status is shown twice. (show details)

Symptom	RAS Events for pair canisters are shown twice. Once as top / bottom, once as left/right.
Environment	ESS 6000
Trigger	It may occur on Utility Node.
Workaround	None

5.1.8.2

RAS Events / Health Monitoring

IJ47862

The Hardware monitoring is not working. (show details)

Symptom	No RAS Events from Hardware monitoring.
Environment	All ESS since ESS3500
Trigger	If hardware monitoring is started and then health monitoring is started within 60 seconds health monitoring will not get any data forever.
Workaround	mmsysmoncontrol restart

5.1.8.2

RAS Events / Health Monitoring

IJ47920

Under some work loads, it is possible that writebehind thread doesn't get wakeup to handle buffers placed on writebehind list. This could happen when all workloads are detected random. (show details)

Symptom	Hang/Deadlock/Unresponsiveness/Long Waiters
Environment	ALL Operating System environments
Trigger	Run workload that perform exclusively random IO with some small random write.
Workaround	Run a mix of workload that perform both random and sequential IO to files.

5.1.8.2

All Scale Users

IJ47557

mmappypolicy with LIST rules erroneously show the RULE and LIST name as the LIST name for both. (show details)

Symptom	Unexpected Results/Behavior
Environment	ALL
Trigger	Run mmapplypolicy with LIST rules.
Workaround	None

5.1.8.2

Policy

IJ48017

CES IP moving around because of RTNETLINK answers: File exists rc=2 error (show details)

Symptom	CES going into DEGRADED state frequently with IP moving around.
Environment	Linux Only
Trigger	CES IP failover
Workaround	None

5.1.8.2

CES

IJ48244

When running mmfsckx on a file system at an older version that does not support 64-bit inode numbers the file system manager node can hit the assert "Assert exp(recType == FilePtrChange1LogRec || recType == FilePtrChange2LogRec || recType == FilePtrChange1FsckxLogRec)" (show details)

Symptom	Node crash
Environment	All
Trigger	Running mmfsckx on a file system below version 3.4.0.0
Workaround	Upgrade file system to version 3.4.0.0 or above before running mmfsckx or else use offline mmfsck

5.1.8.2

MMFSCKX

IJ48279

AFM fails to replicate the files with afmFastCreate option if the newly created file is renamed to a different directory and it's original parent is deleted (show details)

Symptom	Unexpected results, file tree mismatch
Environment	All Linux OS environments
Trigger	Using afmFastCreate option to replicate data
Workaround	Disable afmFastCreate

5.1.8.2

AFM

IJ48280

In Rename operation, old file can be deleted and renamed file is existed at target. In this case, Rename and setXattr operation is getting requeued because target file is not deleted already. (show details)

Symptom	Replication stuck
Environment	All
Trigger	Rename operation is stuck
Workaround	None

5.1.8.2

AFMCOS

IJ48285

Assert goes off when the temporary file is linked (created with O_TMPFILE and linkat) and the inode data have to be evicted to accommodate AFM xattrs. (show details)

Symptom	Crash
Environment	All Linux OS environments
Trigger	Temporary file is linked (created with O_TMPFILE and linkat) with data in inode on the AFM fileset
Workaround	None

5.1.8.2

AFM

IJ48286

Recovery operation doesn't queue setXAttr to push ACL changes and it caused ACL mismatches. As part of Recovery, setAttr operation is getting queued which should have been queued as setXAttr. (show details)

Symptom	ACL mismatches
Environment	All
Trigger	ACL mismatches
Workaround	None

5.1.8.2

AFMCOS

IJ47116

Online replica compare function could incorrectly flag replica mismatch on certain metadata file such as symbolic link in an AFM enabled file system. (show details)

Symptom	Error output/message
Environment	ALL Operating System environments
Trigger	Run online replica compare function.
Workaround	Ignore replica mismatch on special metadata file such as link.

5.1.8.1

AFM

IJ47118

GPFS daemon assert: exp(getDeEntType() == detUnlucky) in Direct.h. This could occur when there are concurrent access to the same directory with one node perform delete on a file while another node try to create the same file. (show details)

Symptom	Abend/Crash
Environment	ALL Operating System environments
Trigger	Concurrent delete/create of same file in a directory from multiple nodes.
Workaround	Avoid delete/create same file in a directory from multiple nodes at same time.

5.1.8.1

All Scale Users

IJ47490

Rmdir is queued for a file , not for directory and its trying to delete the dir at remote which is actually file and failed with err 20. (show details)

Symptom	Replication stuck
Environment	Supported platform for AFM COS
Trigger	Replication stuck to progress
Workaround	Drop the requeued message from the queue

5.1.8.1

AFMCOS

IJ47229

Daemon asserts (logAssertFailed: !ofP->destroyOnLastClose) during the AFM prefetch. This is due to incorrectly setting the destroyOnLastClose flag when the file is removed. (show details)

Symptom	Crash
Environment	All Linux OS environments
Trigger	AFM prefetch
Workaround	None

5.1.8.1

AFM

IJ47419

FSSTRUCT error for bad disk address could be incorrectly logged for newly added disk. This could happen if a client node was in the process of unmounting the file system while new disk is been added. (show details)

Symptom	Error output/message
Environment	ALL Operating System environments
Trigger	Unmount file system on client node while adding new disk to the file system.
Workaround	Don't unmount the file system on any client node while new disk is been added to the file system.

5.1.8.1

All Scale Users

IJ47420

Node fail with kernel panic: ogAssertFailed: OWNED_BY_CALLER(ownerField, ownerField). This could happen while GPFS daemon is in the process of shutting down. (show details)

Symptom	Abend/Crash
Environment	ALL Operating System environments
Trigger	GPFS daemon shutdown while user application is still accessing the file system.
Workaround	Unmount all file system before shutdown GPFS daemon

5.1.8.1

All Scale Users

IJ47301

A race condition between the distributed GNR Disk hospital can cause a state update from the master node to a worker node to be rejected.
When the master node wishes to release a disk from the "diagnosing" to "ok" state, it sends a state broadcast to all worker nodes to instruct them to reflect the pdisk's new master state locally.
However, this broadcast can race with addition disk problem reports that are transmitted from the worker to the master.
The result is that the worker node can reject the master's claim that the disk is healthy, and continue holding the disk in diagnosing.
This can lead to blocked file system I/O unless another state change notification is broadcasted from the master, in which case the worker gets another change to resume I/O to the disk. (show details)

Symptom	Stuck IO
Environment	Linux Only
Trigger	This problem can potentially occur when any local I/O error is encountered on a pdisk, but in general the race condition in that path is rare. It is more likely to occur on Spectrum Storage Scale Erasure Code edition during periods of network instability when pdisks are likely to encounter many timeout errors.
Workaround	Restarting the daemon on the nodes with the waiter "Until disk availability stabilizes" can clear out the waiters.

5.1.8.1

ESS/GNR

IJ46923

Spectrum Scale 5.1.4 changed the way NFS handles are generated. This resulted in larger NFS file handles in some cases. NFS exports through Ganesha and the Linux kernel NFS server using NFSv4 are not affected. It turns out that the generated NFS file handles are too large for the Linux kernel NFS server exporting through NFSv3. (show details)

Symptom	IO error
Environment	ALL Linux OS environments
Trigger	Export a Spectrum Scale file system through the Linux kernel NFS server and using NFSv3 with Spectrum Scale 5.1.4 or newer will hit this problem.
Workaround	Change the NFS export to use NFSv4; in that case larger NFS file handles can be used.

5.1.8.1

NFS

IJ47119/a>

If the name of a File System contains one or more underscore characters '_' and Clustered Watch Folder is enabled on said File System then events that are supposed to be delivered to the sink are never sent. (show details)

Symptom	Unexpected Results/Behavior
Environment	ALL Operating System environments
Trigger	A File System Name that contains one or more underscore characters.
Workaround	Remove underscore chars from File Systems Names where Clustered System Watch is to be used.

5.1.8.1

Watch Folder

IJ47299

Scale daemon assert going off: Assert exp(regP->isOwnerLocal() == 0) in file allocR.C, results in Scale mmfsd daemon process down. (show details)

Symptom	Abend/Crash
Environment	All Operating Systems
Trigger	Heavy block space allocation and deallocation in the cluster.
Workaround	None

5.1.8.1

All Scale Users

IJ45892

Two client nodes are working on the same two regions for block deallocations and each client node owns one region of the two and it’s doing the flush for the region it owns, meanwhile, the DeallocHelperThread on each client node is also requesting the ownership for the region owned by the other client node, then the revoke ownership request would be blocked on each other because the two regions are in flushing state but pending for ownership request from each other, thus forms a deadlock. (show details)

Symptom	Deadlock
Environment	All Operating Systems
Trigger	Users files data block deallocations from at least two different client nodes.
Workaround	Restart GPFS on the client node showing long waiter on allocMsgTypeRequestOwnership RPC message from DeallocHelperThread.

5.1.8.1

All Scale Users

IJ47044

If a message needs to be queued and handled later, the receiver thread will just put it into a queue. However, there is a chance that the handler is faster than receiver, which results in Signal 11 if the receiver references the message data buffer which has been freed by the handler (show details)

Symptom	Abend/Crash
Environment	ALL Operating System environments
Trigger	Busy workload
Workaround	None

5.1.8.1

All Scale Users

IJ47296

Change to nsdRAIDDefaultIoTimeout is reset to default after gpfs restart. (show details)

Symptom	Unexpected Results/Behavior
Environment	ALL Linux OS environments
Trigger	Restart gpfs daemon
Workaround	Use mmchconfig nsdRAIDDefaultIoTimeout=xxx -i after gpfs is restarted.

5.1.8.1

ESS/GNR

IJ47297

AFM fails to transfer EAs and ACLs if home is running RHEL 9.2 with kernel NFS server. This is due to an error copying the user space attributes to kernel space. (show details)

Symptom	Unexpected results
Environment	Linux Only
Trigger	AFM replication with RHEL 9.2 kernel NFS server
Workaround	None

5.1.8.1

AFM

IJ47298

AFM object replication does not progress due to open files. If the file is opened by the application, AFM does not replicate the files to object storage unless they are closed. (show details)

Symptom	Unexpected results
Environment	Linux Only
Trigger	AFM+Object replication.
Workaround	None

5.1.8.1

AFM

IJ47423

FCM3 Unable to create RecoveryGroup in hybrid system (show details)

Symptom	Unexpected Results/Behavior
Environment	x86_64-linux only
Trigger	It occurs only when FCM device size is larger than HDD device./td>
Workaround	None

5.1.8.1

ESS/GNR

IJ47277

Rename is getting requeued because its trying to replace the object in COS which is already replaced. This is happened while resync is placed first and later rename operations from normal queue is executed for the same inode rename operation. (show details)

Symptom	Replication stuck
Environment	Supported platform for AFM COS
Trigger	Replication stuck to progress
Workaround	Drop the requeued message from the queue

5.1.8.1

AFMCOS

IJ47422

Log assert going off thisTrailerP->isInUse() in kMalloc-kx.C, resulting in a kernel panic. This is caused by a call to getxattr to retrieve the system.nfs4_acl extended attribute with a too small buffer size. The getxattr call can be invoked by running ls or ls -l. (show details)

Symptom	Abend/Crash
Environment	Linux Only
Trigger	- Have a file with NFSv4 ACL, invoke getxattr() with a small buffer size, e.g buffer size = 1. Or - Perform an ls -l on a directory that contains files with NFSv4 ACLs
Workaround	Avoid calling getxattr to retrieve the system.nfs4_acl extended attribute. By extension, avoid invoking ls or ls -l on directories that contain files with NFSv4 ACLs.

5.1.8.1

All Scale Users/td>

IJ47424

AFM doesn't check the state of a message when dropping it using the \"mmfsadm afm msgdrop\" option. Its better to leave inflight messages be - and drop a message in any other state.
Dropping inflight messages has a long term implication on the queue. It either hits a safety assertion or a Signal 11/6 somewhere to lose the queue. (show details)

Symptom	Crash
Environment	All Linux OS Environments (Acting as AFM Gateway nodes)
Trigger	Dropping a message in the AFM queue that is inflight.
Workaround	User has to carefully put queue into suspended state and then drop messages.

5.1.8.1

AFM

IJ47421

CES warning messages generated from mmcesnetworkmonitor are flooding on a multi network cluster (show details)

Symptom	Log flooding with below messages - 2023-03-21_03:56:33.785-0400: [W] mmcesnetworkmonitor: handleNetworkProblem with lock held: disableIP 10.20.12.25 1 interface not active 2023-03-21_03:56:33.792-0400: [W] mmcesnetworkmonitor: Taking down 10.20.12.25 because it is assigned to another node or has no interface2023-03-21_03:56:33.803-0400: [I] mmcesnetworkmonitor: disableCesIP: 10.20.12.25 has no interface
Environment	Linux Only
Trigger	Multi network CES IPs where some CES IPs can not be hosted on some nodes.
Workaround	Ignore the log warnings. No functional impact.

5.1.8.1

No functional impact. Only log flooding.

IJ47120

The newer lscpu command lists CPU family after the Model name. This causes the code that detects and automatically applies a workaround for GSKIT hangs issue does not work as expected. Commands like mmcrcluster or mmaddnode may hang in GSKIT layer on AMD EPYC family 23 and 25 processors. (show details)

Symptom	Installation and admin commands hang.
Environment	Linux OS environments
Trigger	This problem affects AMD EPYC family 23 and 25 processors running with newer version of lscpu command.
Workaround	Add "ICC_SHIFT=3" line in /usr/lpp/mmfs/lib/gsk8/C/icc/icclib/ICCSIG.txt file on problem nodes.

5.1.8.1

Admin Commands, gskit

IJ47544

GCP now supports listv2 operation, use listv2 operation during the readdir even though --gcs option is used. This improves the performance during the readdir operation. (show details)

Symptom	Unexpected results.
Environment	All Linux OS environments
Trigger	Readdir on AFM+S3+GCP filesets
Workaround	None

5.1.8.1

AFM

IJ47546

Readdir and lookups are sent to the COS during eviction on the AFM+S3 fileset. This impacts eviction performance and also pulls entries from the COS unnecessarily. (show details)

Symptom	Unexpected results.
Environment	All Linux OS environments
Trigger	Eviction on AFM+S3 filesets
Workaround	None

5.1.8.1

AFM

IJ47545

If the file is truncated to zero and rewritten continuously with in very short interval, AFM will not replicate data to home until the file is closed. (show details)

Symptom	Unexpected results.
Environment	All Linux OS environments
Trigger	AFM caching with continuous rewriting to the files.
Workaround	None

5.1.8.1

AFM

IJ46715

Kernel crash with assert: nPrefetchedBuffers > 0. This could happen when application using multiple threads to perform sequential read or write more than 65535 blocks on the same open file. The starting offset of the read/write must not be on GPFS block boundary. (show details)

Symptom	Abend/Crash
Environment	ALL Operating System environments
Trigger	Performing sequential read/write on the same file using multiple threads where starting offsets of each read/write is not on GPFS block boundary.
Workaround	Close/reopen file before performing more than 65535 sequential read/write on the same file using multiple threads.

5.1.8.1

All Scale Users

IJ47464

On an error path a resource leak can happen leading to the reported waiter (show details)

Symptom	Deadlock
Environment	All platforms
Trigger	Some race condition can access the error path
Workaround	Kill GPFS daemon on the node with waiter

5.1.8.1

Core

IJ47466

3K customers using ESS 6.1.5 and above are seeing pems going offline so pems hang. If customer runs scsi-rescan there is potential exposure to crash at pemsSlaveDestroy+0x46. (show details)

Symptom	If pems is going offline there will be some commands like tsplatformstat, sginfo and other that will fail. At the dmesg will see: scsi 11:0:0:0: Device offlined - not ready after error recovery scsi 11:0:0:0: rejecting I/O to offline device INFO: task pemsCliCmdQueue:11928 blocked for more than 120 seconds. and if doing scsi-rescan the crash stack trace will have this pemsSlaveDestroy+0x46.
Environment	ESS 3K
Trigger	Not sure why this issue is being triggered by ESS 6.1.5 and above. I could not recreated at lab.
Workaround	If pems goes offline we need to avoid scsi-rescan to avoid the crash. To recover from offline we can restart the pemsmod module for now until fix is applied

5.1.8.1

ESS 3K

IJ47438

Using the mmcachectl command to display information about files and directories in the local page pool cache does not work if the argument of --inode-num option is greater than max integer value. (show details)

Symptom	Unexpected Output
Environment	ALL Operating System environments
Trigger	When use the mmcachectl command with argument of --inode-num option larger than the max integer value.
Workaround	Run mmcachectl show --show-filename then use grep to display information of a particular inode number. Ex: mmcachectl show --show-filename \| grep -w 4529741824

5.1.8.1

Admin Commands

IJ47425

This APAR addresses two issues related to NFS-Ganesha that can cause crashes. Here are the details:
Issue 1:
NFS-Ganesha may crash with the following stack trace:
(gdb) bt
(gdb) bt
#0 0x00003fffa73e52e8 in raise () from /lib64/libpthread.so.0
#1 0x00003fffa7954628 in crash_handler (signo=6, info=0x3ffefac4a468, ctx=0x3ffefac496f0)
at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/MainNFSD/nfs_init.c:247
#2 <signal handler called>
#3 0x00003fffa717fcb0 in raise () from /lib64/libc.so.6
#4 0x00003fffa718200c in abort () from /lib64/libc.so.6
#5 0x00003fffa79b9fd4 in free_client_record (record=0x3fff200ed130) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/SAL/nfs4_clientid.c:1381
#6 0x00003fffa79ba3d8 in dec_client_record_ref (record=0x3fff200ed130) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/SAL/nfs4_clientid.c:1461
#7 0x00003fffa79b825c in nfs_client_id_expire (clientid=0x3fff200edbd0, make_stale=false)
at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/SAL/nfs4_clientid.c:914
#8 0x00003fffa79c7820 in reserve_lease_or_expire (clientid=0x3fff200edbd0, update=true)
at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/SAL/nfs4_lease.c:181
#9 0x00003fffa7a59db4 in nfs4_op_renew (op=0x3fff029152d0, data=0x3fff0320d9c0, resp=0x3ffee960cab0)
at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/Protocols/NFS/nfs4_op_renew.c:91
#10 0x00003fffa7a2ed80 in process_one_op (data=0x3fff0320d9c0, status=0x3ffefac4cfd0)
at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/Protocols/NFS/nfs4_Compound.c:920
#11 0x00003fffa7a30010 in nfs4_Compound (arg=0x3ffeeabd84a0, req=0x3ffeeabd7c90, res=0x3ffee9854f60)
at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/Protocols/NFS/nfs4_Compound.c:1327
#12 0x00003fffa794dae4 in nfs_rpc_process_request (reqdata=0x3ffeeabd7c90)
Issue 2:
NFS-Ganesha may crash with the following stack trace:
#0 0x00007f27f0a984fb in raise () from /lib64/libpthread.so.0
#1 0x00007f27f2775d7b in crash_handler (signo=11, info=0x7f20e337e930, ctx=0x7f20e337e800) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21/MainNFSD/nfs_init.c:247
#2 <signal handler called>
#3 0x00007f27f28a3cf5 in nlm_granted_callback (obj=0x7f2430001378, lock_entry=0x7f2204302c20) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21/Protocols/NLM/nlm_util.c:609
#4 0x00007f27f27b133b in try_to_grant_lock (lock_entry=0x7f2204302c20) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21/SAL/state_lock.c:1732
#5 0x00007f27f27b177b in process_blocked_lock_upcall (block_data=0x7f2204305510) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21/SAL/state_lock.c:1780
#6 0x00007f27f27ac19c in state_blocked_lock_caller (ctx=0x7f21c8408650) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21/SAL/state_async.c:81
#7 0x00007f27f27f62bd in fridgethr_start_routine (arg=0x7f21c8408650) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21/support/fridgethr.c:556
#8 0x00007f27f0a90ea5 in start_thread () from /lib64/libpthread.so.0
#9 0x00007f27f018fb0d in clone () from /lib64/libc.so.6
(show details)

Symptom	Crash
Environment	Linux Only
Trigger	- For Issue 1, the crash is related to the NFSv4 lease period and can occur due to timing issues, such as delays in lease renewal or a heavily loaded server with multiple client requests. - For Issue 2, the crash is related to blocking lock requests and lock upgrades on the same file by multiple threads, which can lead to timing issues.
Workaround	None

5.1.8.1

NFS-Ganesha crash followed by CES-IP failover.

IJ47445

The Systemhealth monitor reports ib_rdma_port_speed_low even when the physical configuration was correctly setup for high speed data transfer. (show details)

Symptom	Error output/message
Environment	ALL Operating System environments with RDMA
Trigger	The new Infiniband controller Mellanox Technologies MT2910 Family [ConnectX-7] driver produces a slightly different status information output about its data transfer speed configuration when running the ibportstate command. This modified text was not correctly interpreted by the systemhealth monitor, so it assumes low speed configuration, and gave a warning.
Workaround	None

5.1.8.1

System Health

IJ47577

mmsdrfs has '03_COMMENT' entries inside for executed config-changing mm-commands. If such commands include some characters, which cannot be decoded to the customer's default locale (expected to be UTF8, but no guarantees), this breaks the mmsdrfs parser in mmsysmon.py daemon. (show details)

Symptom	- Error output/message - Component Level Outage
Environment	ALL Linux OS environments + AIX
Trigger	Choosing an exotic system locale, which is incompatible with UTF8 and then running mmchconfig while including at least one symbol, which does not map to UTF8.
Workaround	None

5.1.8.1

- System Health
- Callhome
- perfmon (Zimon)

IJ47491

In MU mode, download doesn't work if file is cached then it’s also not validating with COS for recent changes and failed to download. (show details)

Symptom	Download is not working
Environment	Supported platform for AFM COS
Trigger	Download doesn't work on cached file
Workaround	Don’t download the files content from COS

5.1.8.1

AFMCOS

IJ47492

If locale LC_ALL is not set it will trigger an exception preventing the start of the daemon as the python process tries to use the setting. (show details)

Symptom	-Error output/message -Abend/Crash
Environment	AIX/Power only
Trigger	LC_ALL or lang setting not set on server (or potentially set to a missing language package)
Workaround	Set LC_ALL locale setting and also lang or delete line containing 'locale.setlocale(locale.LC_ALL, "")' from mmsysmon.py

5.1.8.1

System Health

IJ47578

There was a longwaiters_found systemhealth event for one cycle but not any deadlock detection was logged at mmfs.log (show details)

Symptom	Unexpected Results/Behavior
Environment	ALL Operating System environments
Trigger	The monitoring executes mmdiag --waiters and if some entries match "^waiting.thread." the event is raised. This state might be volatile and go away in the next monitoring cycle.
Workaround	None

5.1.8.1

System Health

IJ47579

"mmcallhome run SendFile" does not allow to upload files with a size over 50GB, outputting the following message:
Not enough free space to begin a transfer
Failed while executing a query to sysmonitor (show details)

Symptom	Error output/message
Environment	Linux Only
Trigger	whenever the customer attempts to upload a file with a size over 50GB via mmcallhome command
Workaround	None

5.1.8.1

Callhome

IJ47580

For some ESS5000 systems bootdrives errors the iprconfig output was changed, so that they are not detected and service tickets for their replacement are not created as expected. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux Only
Trigger	"iprconfig -c show-config" listing the bootdrives not as "RAID 10 Array Member" but as "RAID 10 4K Array Member"
Workaround	None

5.1.8.1

- System Health
- Callhome

IJ47547

On non-ESS nodes, mmsysmonitor.log has many error messages "HAL request failed" (show details)

Symptom	Miss leading information is mmsymsonitor.log
Environment	Linux Only
Trigger	Persistent problem.
Workaround	None

5.1.8.1

Health monitoring.

IJ47548

If the GPFS daemon mmfsd is down, e.g. GPFS is down, "mmhealth cluster show" does not work. In this case, the customer has no support to debug problems. (show details)

Symptom	- Abend/Crash - mmfsd is down because of maintenance or problem determination.
Environment	Linux Only
Trigger	A remount Failure or unmounts on a busy system.
Workaround	None

5.1.8.1

Health monitoring

IJ47549

If a GPFS snap is created, HAL data is missing. (show details)

Symptom	Missing data in GPFS snap.
Environment	Linux Only
Trigger	Any time a GPFS snap is created.
Workaround	None

5.1.8.1

All Scale Users

IJ47551

On utility node, the command mmhealth node show does not provide data under the component "Native Raid". (show details)

Symptom	Some system information not available.
Environment	Linux Only
Trigger	Persistent problem. It becomes visible if the command "mmhealth node show" is executed.
Workaround	None

5.1.8.1

Health monitoring.

IJ47553

From 5.1.4, with massive file deletions and fileset deletion workload, the system may outage or be low performance due to filesystem manager node is almost not responsive, caused by tremendous quota share relinquish RPCs. (show details)

Symptom	Cluster/File System Outage
Environment	All
Trigger	5.1.4 release changes the quota relinquish RPCs mode
Workaround	None

5.1.8.1

Quotas

IJ47585

System monitoring on CSNA side car is broken. (show details)

Symptom	Install CNSA 5.1.9
Environment	Linux Only
Trigger	mmsysmonc noderoles list output
Workaround	None

5.1.8.1

All

IJ47586

Node role for ESSUtilityNode is shown with a trailing colon. (show details)

Symptom	mmsysmonc noderoles list shows ESSUtilityNode: instead of ESSUtilityNode
Environment	Linux Only
Trigger	mmsysmonc noderoles list output
Workaround	None

5.1.8.1

All

IJ47633

Kernel soft lockup potential on negative dentry lookup (show details)

Symptom	Abend/Crash
Environment	Linux Only
Trigger	Issue can be seen when running workloads that try to access files that do not exist in the filesystem. Looking up files that don’t exist result in negative dentries getting added to the lookup/dentry cache. The problem is triggered when we add a negative dentry to the dentry cache when the operation is not in lookup mode.
Workaround	None

5.1.8.1

All

IJ47639

Enable collecting data by callhome and gpfs snap to analysis of FCM drives. (show details)

Symptom	No Data in a snap, FTDC or callhome upload.
Environment	Linux Only
Trigger	Automated data collection won't work without this fix.
Workaround	None

5.1.8.1

MAPS data collection

INFO001

The release of v5.1.8.0 aligned with the release of v5.1.7.1 PTF. Please refer to v5.1.7.1 for list of APARs. (show details)

Symptom
Environment
Trigger
Workaround

5.1.8.0

INFO