IJ53151 |
High Importance
|
AFM getOutbandList fails to get the changed files and users may not be able to detect the changes to run the prefetch command later.
(show details)
Symptom |
Unexpected Results |
Environment |
All OS environments |
Trigger |
Running mmafmctl getOutbandList command |
Workaround |
None |
|
5.2.2.1 |
AFM |
IJ52694 |
High Importance
|
GPFS can't start on AIX even though the node over 1TB RAM and a small pagepool.
(show details)
Symptom |
Abend/Crash |
Environment |
AIX Only |
Trigger |
The max shared segment size was increased to a value that caused GPFS to request more than 1TB using 4K pages, that goes beyond what AIX can accept. |
Workaround |
reduce the pagepoolMaxPhysMemPct to 50 on the node that GPFS can't start, by# mmchconfig pagepoolMaxPhysMemPct=50Restart GPFS. |
|
5.2.2.1 |
All Scale Users |
IJ53183 |
High Importance
|
On Gateway node shutdown, Gateway node forcefully returns EIO to the application node which is promptly passing on to the application triggering the Read operation.
(show details)
Symptom |
IO Error |
Environment |
Linux Only |
Trigger |
Trigger Read on large 2GB file from app node and when read is in progress, mmshutdown the Gateway node.. |
Workaround |
None |
|
5.2.2.1 |
AFM |
IJ53213 |
Suggested |
Remove dependency from kernel version for afmNFSNconnect.
(show details)
Symptom |
Unexpected Results |
Environment |
Linux Only |
Trigger |
on some kernel version less than 5.3 its possible to enable nconnect option. |
Workaround |
None |
|
5.2.2.1 |
AFM |
IJ53214 |
High Importance
|
With FAL and NFS Ganesha enabled, running workloads with path to an NFS export for long periods of time could result in NFS client ips not being logged in the audit log.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
Wit FAL and NFS Ganesha enabled, run workloads with path to the NFS mount point for long periods of time |
Workaround |
Restart NFS Ganesha if NFS client ips are not being logged |
|
5.2.2.1 |
File Audit Logging, NFS |
IJ53324 |
Critical |
In extremely rare case, directory entry with wrong length could be wrongly created leading to file system panic client node and log recovery failure on file system manager node. This could eventually lead to file system been unmounted everywhere.
(show details)
Symptom |
Cluster/File System Outage |
Environment |
ALL Operating System environments |
Trigger |
Creating new directory entry via file/link create. |
Workaround |
None |
|
5.2.2.1 |
All Scale Users |
IJ53325 |
High Importance
|
tsapolicy server process waits until all client processes terminate before exit. But client process calls wait function incorrectly, which results in delay to exit.
(show details)
Symptom |
Performance Impact/Degradation |
Environment |
all platforms that support mmapplypolicy |
Trigger |
This problem could occur randomly when mmapplypolicy is run with multiple nodes |
Workaround |
none |
|
5.2.2.1 |
mmapplypolicy |
IJ52584 |
Suggested |
The sdrServ was not able to initialize due to the hostname resolution failure of the legacy server-based configuration server. This prevents GPFS daemon from coming up.
(show details)
Symptom |
Startup failure. Hostname resolution failure messages found in mmfs.log. |
Environment |
All |
Trigger |
Startup GPFS |
Workaround |
Temporarily fix the hostname resolution. |
|
5.2.2.1 |
admin command |
IJ53364 |
Suggested |
In our ESS cluster, mmhealth is showing "scale_ptf_update_available" for some cluster members that do not have the specified ptf update available.
(show details)
Symptom |
Performance Impact/Degradation |
Environment |
Linux Only |
Trigger |
auto ptf update checker |
Workaround |
None |
|
5.2.2.1 |
Callhome |
IJ53332 |
High Importance
|
mmbackup command internally communicates with tsbuhelper process using a formatted string to get backup result and the format was changed in Spectrum Scale 5.1.9.0.
mmbackup should accept old format and new format both but fails to handle old format properly. As a result of it, the backup count from the node using old format is not correctly added up.
(show details)
Symptom |
Error output/message |
Environment |
all platforms that support mmbackup. |
Trigger |
This problem could occur if one of remote helper nodes has Spectrum Scale 5.1.8 or older version installed while master node has Spectrum Scale 5.1.9 or higher version installed. |
Workaround |
run mmbackup on the node where Spectrum Scale 5.1.8 or older version is installed |
|
5.2.2.1 |
mmbackup |
IJ53333 |
High Importance
|
Add an option in mmafmctl to checkDeleted files and dirs which might be hogging the usedInodes count on the fileset.
(show details)
Symptom |
Unexpected Behavior. |
Environment |
Linux Only |
Trigger |
File/dir has been deleted at the cache/primary site but replication of the same is not completed to remote site. Fileset was stopped or disabled of AFM before this leading to permanent hold on the deleted inodes. |
Workaround |
Run a policy manually to see NLINK 0 inodes in the AFM fileset. |
|
5.2.2.1 |
AFM |
IJ53420 |
High Importance
|
GPFS daemon could fail unexpectedly with assert after file system unmounted due to panic.
(show details)
Symptom |
Abend/Crash |
Environment |
ALL Operating System environments |
Trigger |
File system panic |
Workaround |
None |
|
5.2.2.1 |
All Scale Users |
IJ53421 |
High Importance
|
Failed to register with GPFS: Bad file descriptor when SMB tries tree connect
(show details)
Symptom |
Crash |
Environment |
Linux Only |
Trigger |
A Samba process calls gpfs_register_cifs_export. That results in the process being registered in a table. This interface calls alloc_file() which triggers the issue. |
Workaround |
None |
|
5.2.2.1 |
GPFS core |
IJ53480 |
High Importance
|
Assertion `*errP != E_OK || amP == NULL || amP->getPoolId() == poolId' in mmreclaimspaceDue to a coding error, instead of using poolIndex, poolId was used to access an internal array of disks in the storage pool, causing assertion. The assertion is hit when the file system has thin disks with more than one storage pools.
(show details)
Symptom |
Daemon fails with an assertion |
Environment |
All |
Trigger |
During mmreclaimspace, reclaimReservedThinSpaceInPool() invoked accessAllocMapById wrongly using 'poolIndex' instead of 'poolId' leading to the assertion.The assertion is hit when the file system has thin disks with more than one storage pools. |
Workaround |
None |
|
5.2.2.1 |
thin provisioning |
IJ53481 |
High Importance
|
When monitoring gets stopped and restarted due to movement of pods an error occurred which prevented the communication of that change to the cluster manager. More likely in CNSA than a classic scale deployment
(show details)
Symptom |
Error output/message |
Environment |
ALL Linux OS environments |
Trigger |
When monitoring gets stopped and restarted due to movement of pods an error occurred which prevented the communication of that change to the cluster manager. |
Workaround |
As a possible workaround call mmhealth node show --resync -a. |
|
5.2.2.1 |
System Health |
IJ53482 |
Suggested |
The event points to a large pagepool combined with a not so fast network. This might not be an issue depending on the usage and therefore raises a tip which customers should be able to hide (acknowledge) to no longer see it in mmhealth or GUI.
(show details)
Symptom |
Error output/message |
Environment |
ALL Linux OS environments |
Trigger |
Incorrect classification of event. |
Workaround |
As a possible workaround change the trigger value pagepoolNetworkRatioPercent to a lower values than the default 10%. |
|
5.2.2.1 |
System Health |
IJ53372 |
High Importance
|
GPFS leaks kernel memory every time a user that is a member in more than 32 groups tries to access an inode that denies access to that user through simple modebits (no ACL). This might go unnoticed, but if these conditions occur repeatedly, the kernel memory leak can affect the node operations, requiring a reboot to avoid outages.
(show details)
Symptom |
Abend/Crash (in the worst case that the kernel memory leak goes undetected, leading to OOM kills and node outage) |
Environment |
ALL Linux OS environments |
Trigger |
All Scale Users |
Workaround |
The only workaround would be reducing the number of groups to ensure that no user is a member in more than 32 groups. |
|
5.2.2.1 |
All Scale Users |
IJ53490 |
High Importance
|
The timeout test result is not consistent on AMD EPYC-Genoa Processor. If the test passes, the GSKIT hangs workaround will not be applied. This causes problem later
(show details)
Symptom |
Installation and admin commands hang. |
Environment |
Linux OS environments |
Trigger |
This problem affects AMD EPYC-Genoa. |
Workaround |
Manually apply the workaround |
|
5.2.2.1 |
Admin Commands, gskit |
IJ53548 |
Suggested |
Attempting to set a timestamp in GPFS to a time before Jan 1 1970 results in an unexpected timestamp being stored. GPFS currently stores timestamps as a 32bit unsigned integer, and thus can store timestamps from Jan 1 1970 00:00:00 UTC to 7 February 2106 at 06:28:15 UTC. Setting a timestamp before 1970 was silently accepted.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
ALL Linux OS environments |
Trigger |
Attempt to set timestamp on a GPFS inode before 1970, e.g.: touch -m -t 196001010000 testfile |
Workaround |
Avoid setting timestamps outside the supported range in GPFS. |
|
5.2.2.1 |
All Scale Users |
IJ53549 |
High Importance
|
Concurrently issuing read system calls while mmap writeback is running for the same region in the same file can result in the mentioned assert being hit. This is due to a coordination problem between the mmap writeback and handling of read system calls.
(show details)
Symptom |
Abend/Crash |
Environment |
ALL Linux OS environments |
Trigger |
mmap a file and write to mmap region. Issue regular read calls concurrently to the background mmap writeback. Enabling HAWC seems to increase the likelihood of hitting this problem. |
Workaround |
The race condition leading to this assert always exists. During tests it was only hit with HAWC enabled, so disabling HAWC might help to make this less likely to hit. |
|
5.2.2.1 |
All Scale Users |
IJ53550 |
High Importance
|
On AIX, GPFS commands may fail with sed: illegal option -i. This may occur if the cluster key has expired and GPFS commands tryto regenerate a temporarily certificate to get things going.
(show details)
Symptom |
Error output/message Upgrade/Install failure |
Environment |
AIX/Power only |
Trigger |
This issue affects AIX nodes with cluster keys has expired. |
Workaround |
If the cluster key expired, generate new keys and commit the newkeys before upgrade. Run GPFS commands on nodes that support -i option of sed command. |
|
5.2.2.1 |
Admin Commands |
IJ53560 |
High Importance
|
Upon hitting SGPanic (due to OOS (Out Of Space) at the physical device(s)) in the file system with FCMs, there is (high) possibility that the diskSpaceState stays in DSS_OOS after the emergency space released.
(show details)
Symptom |
After the file system suffers with SGPanic, it will be unmounted. For recovery, the file system will be mounted in 'restricted' mode and the emergency space will be release (with 'mmreclaimspace --emergency-reclaim' command). However, after the emergency space is released, the file system would stay DSS_OOS as oppose to the expectation. Because of this, the file system can't be mounted in 'space-reclaim' mode which is a step required for the recovery. |
Environment |
Linux |
Trigger |
On a file system with FCM 4, fill up the file system until all physical capacity is used so that it can trigger SGPanic due to OOS (Out Of Space) condition. |
Workaround |
None |
|
5.2.2.1 |
thin-provisioning |
IJ53561 |
Medium Importance |
When a file system has multiple storage pools and not all of them are thin-provisioning enabled, the storage pool(s) that is(are) not thin-provisioning enabled ended up reserving the emergency space. This is not only unnecessary and a bug to be fixed.
(show details)
Symptom |
After a file system is mounted, grep 'thin inode' in the internal dump... it will show the following message for the storage pool that is not thin-provisioning enabled.
[root@lothal-qa6-1 ~]# mmfsadm dump all | grep "thin inode"
0: name 'system' Valid nDisks 32 nInUse 32 id 0 poolFlags 2 thin inode 41 nBlocks 4124
1: name 'data' Valid nDisks 4 nInUse 4 id 65537 poolFlags 0 thin inode 42 nBlocks 4131 <<<<<<<
2: name 'flash' Valid nDisks 8 nInUse 8 id 65538 poolFlags 2 thin inode 46 nBlocks 4131
|
Environment |
Linux/AIX |
Trigger |
Mount a file system and check 'thin inode' in the internal dump (mmfsadm dup all). |
Workaround |
None |
|
5.2.2.1 |
thin-provisioning |
IJ53600 |
Suggested |
A Linux kernel change caused GPFS to break disk I/O into many small requests.
(show details)
Symptom |
Performance Impact/Degradation |
Environment |
ALL Linux OS environments with kernel version >= 5.1 |
Trigger |
N/A |
Workaround |
None |
|
5.2.2.1 |
All Scale Users |
IJ53595 |
High Importance
|
The AFM gateway node becomes unresponsive during startup due to numerous filesystem mount requests triggered by active I/O to multiple filesets.
(show details)
Symptom |
Performance impact |
Environment |
Linux Only |
Trigger |
Gateway node startup with multiple AFM filesets starting the recovery. |
Workaround |
None |
|
5.2.2.1 |
AFM |
IJ53593 |
High Importance
|
Logging of failure when is to failed list file is causing deadlock within the mmafmcosctl binary.
(show details)
Symptom |
Deadlock. |
Environment |
Linux Only |
Trigger |
Having failures to log in the download/upload sub command of mmafmcosctl. |
Workaround |
Can run download/upload without --enable-failed-list-file and this problem shouldn't happen. |
|
5.2.2.1 |
AFM |
IJ53594 |
High Importance
|
Earlier a fix for the same issue was made, but it was considering to return RESTART between the Gateway node and app node only whenqueue is dropped. But there can be cases where Gateway node is being shutdown without queue being in dropped state.
(show details)
Symptom |
IO Failure |
Environment |
Linux Only |
Trigger |
Trigger Read on a single large file from COS to Cache and meanwhile shutdown the fileset's gateway or start a gateway node that wasshutdown already. |
Workaround |
None |
|
5.2.2.1 |
AFM |
IJ53592 |
High Importance
|
If its the first or only operation on the list and We attempt to queue it through startMarker, then We use the escaped path as opposed to unescaped path causing the failure in queueing the proper format file name.
(show details)
Symptom |
Unexpected behavior |
Environment |
Linux Only |
Trigger |
AFM |
Workaround |
Have more files without escaped sequence in name ahead of special character filenames which require escape.. In the list file given for download. |
|
5.2.2.1 |
AFM |
IJ52948 |
High Importance
|
Kernel-Crash in Scale 5.2.1.1 - general protection fault and system crash.The crash happens due to a memory corruption after mounting a gpfs filesystem.Sometimes this happens during a filesystem mount and sometimes a little while after.
(show details)
Symptom |
Memory corruption and subsequent crash |
Environment |
Linux Only |
Trigger |
We do not need any particular kernel version. For example the customer that hit this issue was running 4.18.0-553.16.1.el8_10.x86_64. While I have reproduced this on a 6.4 kernel.The length of the fstab entry should be in a sweet spot. What this means that is the memory is allocated from the slab cache which have fixed sizes.This means we may have some extra room in the memory allocated to us till we reach the object boundary and we will not have any corruption till we cross this boundary.The kernel slabs are of object sizes: 8, 16, 32, 64, 96, 128, 192, 256, 512 and so on ..For the problem to appears, we need an fstab entry in which, after the gpfsdev=“fsname” options, there are a sizeable number of characters and options. This leads us to write a larger size then what we requested. |
Workaround |
None |
|
5.2.2.1 |
Scale core |
IJ53426 |
Critical |
When a new file system manager takeover after old file system manager loses quorum, it is possible for new file system to read stripe group descriptor too early which can cause stripe group descriptor updates to be lost.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
ALL Operating System environments |
Trigger |
File system manager loses quorum while running command that updates stripe group descriptor. |
Workaround |
None |
|
5.2.2.0 |
All Scale Users |