IJ47794 |
High Importance
|
Assert at Secondary/Home due to mismatch between inode in ofP and
dentry while doing operation on AFM control file.
(show details)
Symptom |
Scale restart on Secondary/Home. |
Environment |
Linux Only |
Trigger |
AFM replication and SG_PANIC at the Gateway node due to unresponsive home. |
Workaround |
None |
|
5.1.8.2 |
AFM |
IJ47854 |
High Importance
|
Unable to join to AD domain during authentication setup.
(show details)
Symptom |
Error output/message |
Environment |
Linux Only |
Trigger |
Fresh/new auth creation for AD. |
Workaround |
None |
|
5.1.8.2 |
Authentication |
IJ47857 |
Suggested |
mmsysmon: exception: 'NoneType' object has no attribute 'addEvent' occurs in /var/adm/ras/mmsysmonitor.log.
No functionality is affected by this.
(show details)
Symptom |
Error output/message |
Environment |
Linux Only |
Trigger |
Race conditions when restarting the mmsysmon deamon |
Workaround |
None |
|
5.1.8.2 |
System Health |
IJ47917 |
High Importance
|
f nvme card is not probing and you see this message at dmesg:pems_mod:[I]:0125:0648:Failed to open /sys/devices/pci0000:00/0000:00:01.5/0000:05:00.0/nvme/ rc=0
The rc=0 is wrong we should pass an error rc so when we get a new hotplug event then we try to read again sysfs files so the card is available for gems.
So after the hotplug the card is available again at nvme --list but not showing up at tslsenclslot output.
(show details)
Symptom |
The nvme card will not be available if you do tslsenclslot but it is available in nvme --list. |
Environment |
ESS3200/ESS3500 |
Trigger |
The nvme card was not probing at the time pems tried to read the sysfs files. |
Workaround |
Restart pemsmod module so it will read the sysfs files when the module loads. |
|
5.1.8.2 |
ESS Platform |
IJ47858 |
High Importance
|
When trying to access files that have been created inside a directory, access may be denied on certain nodes
because there is a negative dentry associated with the entry cached in the system, even though the dentry should be positive.
(show details)
Symptom |
Unexpected Results/Behavior / IO error |
Environment |
ALL Linux OS Environments |
Trigger |
Race condition with unlink and dentry revalidation |
Workaround |
When the issue is hit, for the client node to correctly recognize files
that exist but have a negative dentry associated with it, the dentry cache must be dropped,
or the filesystem unmounted/remounted. |
|
5.1.8.2 |
All Scale Users |
IJ47918 |
High Importance
|
Kernel assert at function pemsBackEndWork while calling pemsListDequeueFirst
due to NULL pointer resulting in a kernel panic.
(show details)
Symptom |
Crash |
Environment |
ESS3200/ESS3500 |
Trigger |
This a timing issue so really just run workloads at ESS3200/ESS3500. |
Workaround |
None |
|
5.1.8.2 |
ESS Platform |
IJ47853 |
Suggested |
Among the cluster nodes there is at least one network connection from node to node internally established.
It may happen that one or up to all of these connections fail for any reason, but will be reconnect automatically by the backend.
Those reconnect-events are detected by the systemhealth monitor, which does also some bookkeepingabout the connection states.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
ALL Operating System environments in a larger cluster |
Trigger |
There seem to be no clear steps to reproduce the issue. It was observed during system setup, but also at runtime.
Basically it occurs at situations where node-to-node connectios are dropped andrecreated by either one of the affected nodes.
But this is not always the case. The created connection state events should not show an issue if the connections are fully restored again. |
Workaround |
Restart the systemhealth monitor (mmsysmoncontrol restart).This clears the internal connection state bookkeeping. |
|
5.1.8.2 |
System Health |
IJ47855 |
Suggested |
A larger SNC cluster showed massive network traffic caused by the disk monitor module of the systemhealth monitor.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
ALL Operating System environments in a larger cluster |
Trigger |
The systemhealth monitor was using the command mmlsdisk with a large amount of disk names as argument, which caused the internal traffic. |
Workaround |
None |
|
5.1.8.2 |
System Health |
IJ47856 |
Medium Importance |
Unable to start Samba services after joining a node to CES existing cluster.
(show details)
Symptom |
Error output/message |
Environment |
Linux Only |
Trigger |
Fresh/new addition of node to existing CES cluster with samba enabled. |
Workaround |
Start samba services manually after node addition to cluster. |
|
5.1.8.2 |
Samba |
IJ47859 |
Medium Importance |
The current state of the PSUs are not show by RAS Events for Utility Node.
(show details)
Symptom |
No information about Utlity Node. |
Environment |
ESS Utility Node Only |
Trigger |
Occurs always on Utility Node. |
Workaround |
None |
|
5.1.8.2 |
RAS Events / Health monitoring |
IJ47860 |
High Importance
|
With ESS6000 and Utility Node it is possible that the same event is raised from two different sources.
(show details)
Symptom |
RAS Events are changing their status frequently. |
Environment |
ESS 6000 |
Trigger |
It may occur on ESS6000. |
Workaround |
None |
|
5.1.8.2 |
RAS Events / Health monitoring |
IJ47861 |
High Importance
|
The pair canister status is shown twice.
(show details)
Symptom |
RAS Events for pair canisters are shown twice. Once as top / bottom, once as left/right. |
Environment |
ESS 6000 |
Trigger |
It may occur on Utility Node. |
Workaround |
None |
|
5.1.8.2 |
RAS Events / Health Monitoring
|
IJ47862 |
High Importance
|
The Hardware monitoring is not working.
(show details)
Symptom |
No RAS Events from Hardware monitoring. |
Environment |
All ESS since ESS3500 |
Trigger |
If hardware monitoring is started and then health monitoring is started within 60 seconds health monitoring will not get any data forever. |
Workaround |
mmsysmoncontrol restart |
|
5.1.8.2 |
RAS Events / Health Monitoring |
IJ47920 |
Critical |
Under some work loads, it is possible that writebehind thread doesn't get wakeup to handle buffers placed on writebehind list.
This could happen when all workloads are detected random.
(show details)
Symptom |
Hang/Deadlock/Unresponsiveness/Long Waiters |
Environment |
ALL Operating System environments |
Trigger |
Run workload that perform exclusively random IO with some small random write. |
Workaround |
Run a mix of workload that perform both random and sequential IO to files. |
|
5.1.8.2 |
All Scale Users |
IJ47557 |
Suggested |
mmappypolicy with LIST rules erroneously show the RULE and LIST name as the LIST name for both.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
ALL |
Trigger |
Run mmapplypolicy with LIST rules. |
Workaround |
None |
|
5.1.8.2 |
Policy |
IJ48017 |
Medium Importance |
CES IP moving around because of RTNETLINK answers: File exists rc=2 error
(show details)
Symptom |
CES going into DEGRADED state frequently with IP moving around. |
Environment |
Linux Only |
Trigger |
CES IP failover |
Workaround |
None |
|
5.1.8.2 |
CES |
IJ48244 |
High Importance
|
When running mmfsckx on a file system at an older version that does not support 64-bit inode numbers
the file system manager node can hit the assert "Assert exp(recType == FilePtrChange1LogRec || recType == FilePtrChange2LogRec
|| recType == FilePtrChange1FsckxLogRec)"
(show details)
Symptom |
Node crash |
Environment |
All |
Trigger |
Running mmfsckx on a file system below version 3.4.0.0 |
Workaround |
Upgrade file system to version 3.4.0.0 or above before running mmfsckx or else use offline mmfsck |
|
5.1.8.2 |
MMFSCKX |
IJ48279 |
HIPER |
AFM fails to replicate the files with afmFastCreate option if the newly created file is renamed to a different directory and it's original parent is deleted
(show details)
Symptom |
Unexpected results, file tree mismatch |
Environment |
All Linux OS environments |
Trigger |
Using afmFastCreate option to replicate data |
Workaround |
Disable afmFastCreate |
|
5.1.8.2 |
AFM |
IJ48280 |
Suggested |
In Rename operation, old file can be deleted and renamed file is existed at target.
In this case, Rename and setXattr operation is getting requeued because target file is not deleted already.
(show details)
Symptom |
Replication stuck |
Environment |
All |
Trigger |
Rename operation is stuck |
Workaround |
None |
|
5.1.8.2 |
AFMCOS |
IJ48285 |
HIPER |
Assert goes off when the temporary file is linked (created with O_TMPFILE and linkat)
and the inode data have to be evicted to accommodate AFM xattrs.
(show details)
Symptom |
Crash |
Environment |
All Linux OS environments |
Trigger |
Temporary file is linked (created with O_TMPFILE and linkat) with data in inode on the AFM fileset |
Workaround |
None |
|
5.1.8.2 |
AFM |
IJ48286 |
Suggested |
Recovery operation doesn't queue setXAttr to push ACL changes and it caused ACL mismatches.
As part of Recovery, setAttr operation is getting queued which should have been queued as setXAttr.
(show details)
Symptom |
ACL mismatches |
Environment |
All |
Trigger |
ACL mismatches |
Workaround |
None |
|
5.1.8.2 |
AFMCOS |
IJ47116 |
Suggested |
Online replica compare function could incorrectly flag replica mismatch on certain metadata file such as symbolic link in an AFM enabled file system.
(show details)
Symptom |
Error output/message |
Environment |
ALL Operating System environments |
Trigger |
Run online replica compare function. |
Workaround |
Ignore replica mismatch on special metadata file such as link. |
|
5.1.8.1 |
AFM |
IJ47118 |
High Importance
|
GPFS daemon assert: exp(getDeEntType() == detUnlucky) in Direct.h. This could occur when there are concurrent access to the same directory with one node perform delete on a file while another node try to create the same file.
(show details)
Symptom |
Abend/Crash |
Environment |
ALL Operating System environments |
Trigger |
Concurrent delete/create of same file in a directory from multiple nodes. |
Workaround |
Avoid delete/create same file in a directory from multiple nodes at same time. |
|
5.1.8.1 |
All Scale Users |
IJ47490 |
Suggested |
Rmdir is queued for a file , not for directory and its trying to delete the dir at remote which is actually file and failed with err 20.
(show details)
Symptom |
Replication stuck |
Environment |
Supported platform for AFM COS |
Trigger |
Replication stuck to progress |
Workaround |
Drop the requeued message from the queue |
|
5.1.8.1 |
AFMCOS |
IJ47229 |
High Importance
|
Daemon asserts (logAssertFailed: !ofP->destroyOnLastClose) during the AFM prefetch. This is due to incorrectly setting the destroyOnLastClose flag when the file is removed.
(show details)
Symptom |
Crash |
Environment |
All Linux OS environments |
Trigger |
AFM prefetch |
Workaround |
None |
|
5.1.8.1 |
AFM |
IJ47419 |
High Importance
|
FSSTRUCT error for bad disk address could be incorrectly logged for newly added disk. This could happen if a client node was in the process of unmounting the file system while new disk is been added.
(show details)
Symptom |
Error output/message |
Environment |
ALL Operating System environments |
Trigger |
Unmount file system on client node while adding new disk to the file system. |
Workaround |
Don't unmount the file system on any client node while new disk is been added to the file system. |
|
5.1.8.1 |
All Scale Users |
IJ47420 |
Critical |
Node fail with kernel panic: ogAssertFailed: OWNED_BY_CALLER(ownerField, ownerField). This could happen while GPFS daemon is in the process of shutting down.
(show details)
Symptom |
Abend/Crash |
Environment |
ALL Operating System environments |
Trigger |
GPFS daemon shutdown while user application is still accessing the file system. |
Workaround |
Unmount all file system before shutdown GPFS daemon |
|
5.1.8.1 |
All Scale Users |
IJ47301 |
High Importance
|
A race condition between the distributed GNR Disk hospital can cause a state update from the master node to a worker node to be rejected.
When the master node wishes to release a disk from the "diagnosing" to "ok" state, it sends a state broadcast to all worker nodes to instruct them to reflect the pdisk's new master state locally.
However, this broadcast can race with addition disk problem reports that are transmitted from the worker to the master.
The result is that the worker node can reject the master's claim that the disk is healthy, and continue holding the disk in diagnosing.
This can lead to blocked file system I/O unless another state change notification is broadcasted from the master, in which case the worker gets another change to resume I/O to the disk.
(show details)
Symptom |
Stuck IO |
Environment |
Linux Only |
Trigger |
This problem can potentially occur when any local I/O error is encountered on a pdisk, but in general the race condition in that path is rare.
It is more likely to occur on Spectrum Storage Scale Erasure Code edition during periods of network instability when pdisks are likely to encounter many timeout errors. |
Workaround |
Restarting the daemon on the nodes with the waiter "Until disk availability stabilizes" can clear out the waiters. |
|
5.1.8.1 |
ESS/GNR |
IJ46923 |
High Importance
|
Spectrum Scale 5.1.4 changed the way NFS handles are generated. This resulted in larger NFS file handles in some cases. NFS exports through Ganesha and the Linux kernel NFS server using NFSv4 are not affected. It turns out that the generated NFS file handles are too large for the Linux kernel NFS server exporting through NFSv3.
(show details)
Symptom |
IO error |
Environment |
ALL Linux OS environments |
Trigger |
Export a Spectrum Scale file system through the Linux kernel NFS server and using NFSv3 with Spectrum Scale 5.1.4 or newer will hit this problem. |
Workaround |
Change the NFS export to use NFSv4; in that case larger NFS file handles can be used. |
|
5.1.8.1 |
NFS |
IJ47119/a> |
Suggested |
If the name of a File System contains one or more underscore characters '_' and Clustered Watch Folder is enabled on said File System then events that are supposed to be delivered to the sink are never sent.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
ALL Operating System environments |
Trigger |
A File System Name that contains one or more underscore characters. |
Workaround |
Remove underscore chars from File Systems Names where Clustered System Watch is to be used. |
|
5.1.8.1 |
Watch Folder |
IJ47299 |
High Importance
|
Scale daemon assert going off: Assert exp(regP->isOwnerLocal() == 0) in file allocR.C, results in Scale mmfsd daemon process down.
(show details)
Symptom |
Abend/Crash |
Environment |
All Operating Systems |
Trigger |
Heavy block space allocation and deallocation in the cluster. |
Workaround |
None |
|
5.1.8.1 |
All Scale Users |
IJ45892 |
High Importance
|
Two client nodes are working on the same two regions for block deallocations and each client node owns one region of the two and it’s doing the flush for the region it owns, meanwhile, the DeallocHelperThread on each client node is also requesting the ownership for the region owned by the other client node, then the revoke ownership request would be blocked on each other because the two regions are in flushing state but pending for ownership request from each other, thus forms a deadlock.
(show details)
Symptom |
Deadlock |
Environment |
All Operating Systems |
Trigger |
Users files data block deallocations from at least two different client nodes. |
Workaround |
Restart GPFS on the client node showing long waiter on allocMsgTypeRequestOwnership RPC message from DeallocHelperThread. |
|
5.1.8.1 |
All Scale Users |
IJ47044 |
High Importance
|
If a message needs to be queued and handled later, the receiver thread will just put it into a queue. However, there is a chance that the handler is faster than receiver, which results in Signal 11 if the receiver references the message data buffer which has been freed by the handler
(show details)
Symptom |
Abend/Crash |
Environment |
ALL Operating System environments |
Trigger |
Busy workload |
Workaround |
None |
|
5.1.8.1 |
All Scale Users |
IJ47296 |
High Importance
|
Change to nsdRAIDDefaultIoTimeout is reset to default after gpfs restart.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
ALL Linux OS environments |
Trigger |
Restart gpfs daemon |
Workaround |
Use mmchconfig nsdRAIDDefaultIoTimeout=xxx -i after gpfs is restarted. |
|
5.1.8.1 |
ESS/GNR |
IJ47297 |
High Importance
|
AFM fails to transfer EAs and ACLs if home is running RHEL 9.2 with kernel NFS server. This is due to an error copying the user space attributes to kernel space.
(show details)
Symptom |
Unexpected results |
Environment |
Linux Only |
Trigger |
AFM replication with RHEL 9.2 kernel NFS server |
Workaround |
None |
|
5.1.8.1 |
AFM |
IJ47298 |
High Importance
|
AFM object replication does not progress due to open files. If the file is opened by the application, AFM does not replicate the files to object storage unless they are closed.
(show details)
Symptom |
Unexpected results |
Environment |
Linux Only |
Trigger |
AFM+Object replication. |
Workaround |
None |
|
5.1.8.1 |
AFM |
IJ47423 |
High Importance
|
FCM3 Unable to create RecoveryGroup in hybrid system
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
x86_64-linux only |
Trigger |
It occurs only when FCM device size is larger than HDD device./td>
|
Workaround |
None |
|
5.1.8.1 |
ESS/GNR |
IJ47277 |
Suggested |
Rename is getting requeued because its trying to replace the object in COS which is already replaced. This is happened while resync is placed first and later rename operations from normal queue is executed for the same inode rename operation.
(show details)
Symptom |
Replication stuck |
Environment |
Supported platform for AFM COS |
Trigger |
Replication stuck to progress |
Workaround |
Drop the requeued message from the queue |
|
5.1.8.1 |
AFMCOS |
IJ47422 |
High Importance
|
Log assert going off thisTrailerP->isInUse() in kMalloc-kx.C, resulting in a kernel panic. This is caused by a call to getxattr to retrieve the system.nfs4_acl extended attribute with a too small buffer size. The getxattr call can be invoked by running ls or ls -l.
(show details)
Symptom |
Abend/Crash |
Environment |
Linux Only |
Trigger |
- Have a file with NFSv4 ACL, invoke getxattr() with a small buffer size, e.g buffer size = 1. Or - Perform an ls -l on a directory that contains files with NFSv4 ACLs |
Workaround |
Avoid calling getxattr to retrieve the system.nfs4_acl extended attribute. By extension, avoid invoking ls or ls -l on directories that contain files with NFSv4 ACLs. |
|
5.1.8.1 |
All Scale Users/td>
|
IJ47424 |
High Importance
|
AFM doesn't check the state of a message when dropping it using the \"mmfsadm afm msgdrop\" option. Its better to leave inflight messages be - and drop a message in any other state. Dropping inflight messages has a long term implication on the queue. It either hits a safety assertion or a Signal 11/6 somewhere to lose the queue.
(show details)
Symptom |
Crash |
Environment |
All Linux OS Environments (Acting as AFM Gateway nodes) |
Trigger |
Dropping a message in the AFM queue that is inflight. |
Workaround |
User has to carefully put queue into suspended state and then drop messages. |
|
5.1.8.1 |
AFM |
IJ47421 |
Suggested |
CES warning messages generated from mmcesnetworkmonitor are flooding on a multi network cluster
(show details)
Symptom |
Log flooding with below messages -
2023-03-21_03:56:33.785-0400: [W] mmcesnetworkmonitor: handleNetworkProblem with lock held: disableIP 10.20.12.25 1 interface not active
2023-03-21_03:56:33.792-0400: [W] mmcesnetworkmonitor: Taking down 10.20.12.25 because it is assigned to another node or has no interface2023-03-21_03:56:33.803-0400: [I] mmcesnetworkmonitor: disableCesIP: 10.20.12.25 has no interface |
Environment |
Linux Only |
Trigger |
Multi network CES IPs where some CES IPs can not be hosted on some nodes. |
Workaround |
Ignore the log warnings. No functional impact. |
|
5.1.8.1 |
No functional impact. Only log flooding. |
IJ47120 |
High Importance
|
The newer lscpu command lists CPU family after the Model name. This causes the code that detects and automatically applies a workaround for GSKIT hangs issue does not work as expected. Commands like mmcrcluster or mmaddnode may hang in GSKIT layer on AMD EPYC family 23 and 25 processors.
(show details)
Symptom |
Installation and admin commands hang. |
Environment |
Linux OS environments |
Trigger |
This problem affects AMD EPYC family 23 and 25 processors running with newer version of lscpu command. |
Workaround |
Add "ICC_SHIFT=3" line in /usr/lpp/mmfs/lib/gsk8/C/icc/icclib/ICCSIG.txt file on problem nodes. |
|
5.1.8.1 |
Admin Commands, gskit |
IJ47544 |
High Importance
|
GCP now supports listv2 operation, use listv2 operation during the readdir even though --gcs option is used. This improves the performance during the readdir operation.
(show details)
Symptom |
Unexpected results. |
Environment |
All Linux OS environments |
Trigger |
Readdir on AFM+S3+GCP filesets |
Workaround |
None |
|
5.1.8.1 |
AFM |
IJ47546 |
High Importance
|
Readdir and lookups are sent to the COS during eviction on the AFM+S3 fileset. This impacts eviction performance and also pulls entries from the COS unnecessarily.
(show details)
Symptom |
Unexpected results. |
Environment |
All Linux OS environments |
Trigger |
Eviction on AFM+S3 filesets |
Workaround |
None |
|
5.1.8.1 |
AFM |
IJ47545 |
High Importance
|
If the file is truncated to zero and rewritten continuously with in very short interval, AFM will not replicate data to home until the file is closed.
(show details)
Symptom |
Unexpected results. |
Environment |
All Linux OS environments |
Trigger |
AFM caching with continuous rewriting to the files. |
Workaround |
None |
|
5.1.8.1 |
AFM |
IJ46715 |
High Importance
|
Kernel crash with assert: nPrefetchedBuffers > 0. This could happen when application using multiple threads to perform sequential read or write more than 65535 blocks on the same open file. The starting offset of the read/write must not be on GPFS block boundary.
(show details)
Symptom |
Abend/Crash |
Environment |
ALL Operating System environments |
Trigger |
Performing sequential read/write on the same file using multiple threads where starting offsets of each read/write is not on GPFS block boundary. |
Workaround |
Close/reopen file before performing more than 65535 sequential read/write on the same file using multiple threads. |
|
5.1.8.1 |
All Scale Users |
IJ47464 |
High Importance
|
On an error path a resource leak can happen leading to the reported waiter
(show details)
Symptom |
Deadlock |
Environment |
All platforms |
Trigger |
Some race condition can access the error path |
Workaround |
Kill GPFS daemon on the node with waiter |
|
5.1.8.1 |
Core |
IJ47466 |
High Importance
|
3K customers using ESS 6.1.5 and above are seeing pems going offline so pems hang. If customer runs scsi-rescan there is potential exposure to crash at pemsSlaveDestroy+0x46.
(show details)
Symptom |
If pems is going offline there will be some commands like tsplatformstat, sginfo and other that will fail. At the dmesg will see:
scsi 11:0:0:0: Device offlined - not ready after error recovery
scsi 11:0:0:0: rejecting I/O to offline device
INFO: task pemsCliCmdQueue:11928 blocked for more than 120 seconds.
and if doing scsi-rescan the crash stack trace will have this pemsSlaveDestroy+0x46. |
Environment |
ESS 3K |
Trigger |
Not sure why this issue is being triggered by ESS 6.1.5 and above. I could not recreated at lab. |
Workaround |
If pems goes offline we need to avoid scsi-rescan to avoid the crash. To recover from offline we can restart the pemsmod module for now until fix is applied |
|
5.1.8.1 |
ESS 3K |
IJ47438 |
Suggested |
Using the mmcachectl command to display information about files and directories in the local page pool cache does not work if the argument of --inode-num option is greater than max integer value.
(show details)
Symptom |
Unexpected Output |
Environment |
ALL Operating System environments |
Trigger |
When use the mmcachectl command with argument of --inode-num option larger than the max integer value. |
Workaround |
Run mmcachectl show --show-filename then use grep to display information of a particular inode number. Ex: mmcachectl show --show-filename | grep -w 4529741824 |
|
5.1.8.1 |
Admin Commands |
IJ47425 |
Medium Importance |
This APAR addresses two issues related to NFS-Ganesha that can cause crashes. Here are the details:
Issue 1:
NFS-Ganesha may crash with the following stack trace:
(gdb) bt
(gdb) bt
#0 0x00003fffa73e52e8 in raise () from /lib64/libpthread.so.0
#1 0x00003fffa7954628 in crash_handler (signo=6, info=0x3ffefac4a468, ctx=0x3ffefac496f0)
at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/MainNFSD/nfs_init.c:247
#2 <signal handler called>
#3 0x00003fffa717fcb0 in raise () from /lib64/libc.so.6
#4 0x00003fffa718200c in abort () from /lib64/libc.so.6
#5 0x00003fffa79b9fd4 in free_client_record (record=0x3fff200ed130) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/SAL/nfs4_clientid.c:1381
#6 0x00003fffa79ba3d8 in dec_client_record_ref (record=0x3fff200ed130) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/SAL/nfs4_clientid.c:1461
#7 0x00003fffa79b825c in nfs_client_id_expire (clientid=0x3fff200edbd0, make_stale=false)
at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/SAL/nfs4_clientid.c:914
#8 0x00003fffa79c7820 in reserve_lease_or_expire (clientid=0x3fff200edbd0, update=true)
at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/SAL/nfs4_lease.c:181
#9 0x00003fffa7a59db4 in nfs4_op_renew (op=0x3fff029152d0, data=0x3fff0320d9c0, resp=0x3ffee960cab0)
at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/Protocols/NFS/nfs4_op_renew.c:91
#10 0x00003fffa7a2ed80 in process_one_op (data=0x3fff0320d9c0, status=0x3ffefac4cfd0)
at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/Protocols/NFS/nfs4_Compound.c:920
#11 0x00003fffa7a30010 in nfs4_Compound (arg=0x3ffeeabd84a0, req=0x3ffeeabd7c90, res=0x3ffee9854f60)
at /usr/src/debug/nfs-ganesha-3.5-ibm071.21.308708/Protocols/NFS/nfs4_Compound.c:1327
#12 0x00003fffa794dae4 in nfs_rpc_process_request (reqdata=0x3ffeeabd7c90)
Issue 2:
NFS-Ganesha may crash with the following stack trace:
#0 0x00007f27f0a984fb in raise () from /lib64/libpthread.so.0
#1 0x00007f27f2775d7b in crash_handler (signo=11, info=0x7f20e337e930, ctx=0x7f20e337e800) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21/MainNFSD/nfs_init.c:247
#2 <signal handler called>
#3 0x00007f27f28a3cf5 in nlm_granted_callback (obj=0x7f2430001378, lock_entry=0x7f2204302c20) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21/Protocols/NLM/nlm_util.c:609
#4 0x00007f27f27b133b in try_to_grant_lock (lock_entry=0x7f2204302c20) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21/SAL/state_lock.c:1732
#5 0x00007f27f27b177b in process_blocked_lock_upcall (block_data=0x7f2204305510) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21/SAL/state_lock.c:1780
#6 0x00007f27f27ac19c in state_blocked_lock_caller (ctx=0x7f21c8408650) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21/SAL/state_async.c:81
#7 0x00007f27f27f62bd in fridgethr_start_routine (arg=0x7f21c8408650) at /usr/src/debug/nfs-ganesha-3.5-ibm071.21/support/fridgethr.c:556
#8 0x00007f27f0a90ea5 in start_thread () from /lib64/libpthread.so.0
#9 0x00007f27f018fb0d in clone () from /lib64/libc.so.6
(show details)
Symptom |
Crash |
Environment |
Linux Only |
Trigger |
- For Issue 1, the crash is related to the NFSv4 lease period and can occur due to timing issues, such as delays in lease renewal or a heavily loaded server with multiple client requests.
- For Issue 2, the crash is related to blocking lock requests and lock upgrades on the same file by multiple threads, which can lead to timing issues. |
Workaround |
None |
|
5.1.8.1 |
NFS-Ganesha crash followed by CES-IP failover. |
IJ47445 |
Suggested |
The Systemhealth monitor reports ib_rdma_port_speed_low even when the physical configuration was correctly setup for high speed data transfer.
(show details)
Symptom |
Error output/message |
Environment |
ALL Operating System environments with RDMA |
Trigger |
The new Infiniband controller Mellanox Technologies MT2910 Family [ConnectX-7] driver produces a slightly different status information output about its data transfer speed configuration when running the ibportstate command. This modified text was not correctly interpreted by the systemhealth monitor, so it assumes low speed configuration, and gave a warning. |
Workaround |
None |
|
5.1.8.1 |
System Health |
IJ47577 |
High Importance
|
mmsdrfs has '03_COMMENT' entries inside for executed config-changing mm-commands. If such commands include some characters, which cannot be decoded to the customer's default locale (expected to be UTF8, but no guarantees), this breaks the mmsdrfs parser in mmsysmon.py daemon.
(show details)
Symptom |
- Error output/message
- Component Level Outage |
Environment |
ALL Linux OS environments + AIX |
Trigger |
Choosing an exotic system locale, which is incompatible with UTF8 and then running mmchconfig while including at least one symbol, which does not map to UTF8. |
Workaround |
None |
|
5.1.8.1 |
- System Health - Callhome - perfmon (Zimon) |
IJ47491 |
Suggested |
In MU mode, download doesn't work if file is cached then it’s also not validating with COS for recent changes and failed to download.
(show details)
Symptom |
Download is not working |
Environment |
Supported platform for AFM COS |
Trigger |
Download doesn't work on cached file |
Workaround |
Don’t download the files content from COS |
|
5.1.8.1 |
AFMCOS |
IJ47492 |
Critical |
If locale LC_ALL is not set it will trigger an exception preventing the start of the daemon as the python process tries to use the setting.
(show details)
Symptom |
-Error output/message -Abend/Crash |
Environment |
AIX/Power only |
Trigger |
LC_ALL or lang setting not set on server (or potentially set to a missing language package) |
Workaround |
Set LC_ALL locale setting and also lang or delete line containing 'locale.setlocale(locale.LC_ALL, "")' from mmsysmon.py |
|
5.1.8.1 |
System Health |
IJ47578 |
Suggested |
There was a longwaiters_found systemhealth event for one cycle but not any deadlock detection was logged at mmfs.log
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
ALL Operating System environments |
Trigger |
The monitoring executes mmdiag --waiters and if some entries match "^waiting.*thread.*" the event is raised. This state might be volatile and go away in the next monitoring cycle. |
Workaround |
None |
|
5.1.8.1 |
System Health |
IJ47579 |
Suggested |
"mmcallhome run SendFile" does not allow to upload files with a size over 50GB, outputting the following message: Not enough free space to begin a transfer Failed while executing a query to sysmonitor
(show details)
Symptom |
Error output/message |
Environment |
Linux Only |
Trigger |
whenever the customer attempts to upload a file with a size over 50GB via mmcallhome command |
Workaround |
None |
|
5.1.8.1 |
Callhome |
IJ47580 |
High Importance
|
For some ESS5000 systems bootdrives errors the iprconfig output was changed, so that they are not detected and service tickets for their replacement are not created as expected.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux Only |
Trigger |
"iprconfig -c show-config" listing the bootdrives not as "RAID 10 Array Member" but as "RAID 10 4K Array Member" |
Workaround |
None |
|
5.1.8.1 |
- System Health - Callhome |
IJ47547 |
Suggested |
On non-ESS nodes, mmsysmonitor.log has many error messages "HAL request failed"
(show details)
Symptom |
Miss leading information is mmsymsonitor.log |
Environment |
Linux Only |
Trigger |
Persistent problem. |
Workaround |
None |
|
5.1.8.1 |
Health monitoring. |
IJ47548 |
High Importance
|
If the GPFS daemon mmfsd is down, e.g. GPFS is down, "mmhealth cluster show" does not work. In this case, the customer has no support to debug problems.
(show details)
Symptom |
- Abend/Crash - mmfsd is down because of maintenance or problem determination. |
Environment |
Linux Only |
Trigger |
A remount Failure or unmounts on a busy system. |
Workaround |
None |
|
5.1.8.1 |
Health monitoring |
IJ47549 |
Suggested |
If a GPFS snap is created, HAL data is missing.
(show details)
Symptom |
Missing data in GPFS snap. |
Environment |
Linux Only |
Trigger |
Any time a GPFS snap is created. |
Workaround |
None |
|
5.1.8.1 |
All Scale Users |
IJ47551 |
Medium Importance |
On utility node, the command mmhealth node show does not provide data under the component "Native Raid".
(show details)
Symptom |
Some system information not available. |
Environment |
Linux Only |
Trigger |
Persistent problem. It becomes visible if the command "mmhealth node show" is executed. |
Workaround |
None |
|
5.1.8.1 |
Health monitoring. |
IJ47553 |
High Importance
|
From 5.1.4, with massive file deletions and fileset deletion workload, the system may outage or be low performance due to filesystem manager node is almost not responsive, caused by tremendous quota share relinquish RPCs.
(show details)
Symptom |
Cluster/File System Outage |
Environment |
All |
Trigger |
5.1.4 release changes the quota relinquish RPCs mode |
Workaround |
None |
|
5.1.8.1 |
Quotas |
IJ47585 |
High Importance
|
System monitoring on CSNA side car is broken.
(show details)
Symptom |
Install CNSA 5.1.9 |
Environment |
Linux Only |
Trigger |
mmsysmonc noderoles list output |
Workaround |
None |
|
5.1.8.1 |
All |
IJ47586 |
Suggested |
Node role for ESSUtilityNode is shown with a trailing colon.
(show details)
Symptom |
mmsysmonc noderoles list shows ESSUtilityNode: instead of ESSUtilityNode |
Environment |
Linux Only |
Trigger |
mmsysmonc noderoles list output |
Workaround |
None |
|
5.1.8.1 |
All |
IJ47633 |
High Importance
|
Kernel soft lockup potential on negative dentry lookup
(show details)
Symptom |
Abend/Crash |
Environment |
Linux Only |
Trigger |
Issue can be seen when running workloads that try to access files that do not exist in the filesystem. Looking up files that don’t exist result in negative dentries getting added to the lookup/dentry cache. The problem is triggered when we add a negative dentry to the dentry cache when the operation is not in lookup mode. |
Workaround |
None |
|
5.1.8.1 |
All |
IJ47639 |
Suggested |
Enable collecting data by callhome and gpfs snap to analysis of FCM drives.
(show details)
Symptom |
No Data in a snap, FTDC or callhome upload. |
Environment |
Linux Only |
Trigger |
Automated data collection won't work without this fix. |
Workaround |
None |
|
5.1.8.1 |
MAPS data collection |
INFO001 |
Suggested |
The release of v5.1.8.0 aligned with the release of v5.1.7.1 PTF. Please refer to v5.1.7.1 for list of APARs.
(show details)
Symptom |
|
Environment |
|
Trigger |
|
Workaround |
|
|
5.1.8.0 |
INFO |