IBM Storage Scale APARs Resolved in 5.2.2.x

IJ53151

AFM getOutbandList fails to get the changed files and users may not be able to detect the changes to run the prefetch command later. (show details)

Symptom	Unexpected Results
Environment	All OS environments
Trigger	Running mmafmctl getOutbandList command
Workaround	None

5.2.2.1

AFM

IJ52694

GPFS can't start on AIX even though the node over 1TB RAM and a small pagepool. (show details)

Symptom	Abend/Crash
Environment	AIX Only
Trigger	The max shared segment size was increased to a value that caused GPFS to request more than 1TB using 4K pages, that goes beyond what AIX can accept.
Workaround	reduce the pagepoolMaxPhysMemPct to 50 on the node that GPFS can't start, by# mmchconfig pagepoolMaxPhysMemPct=50Restart GPFS.

5.2.2.1

All Scale Users

IJ53183

On Gateway node shutdown, Gateway node forcefully returns EIO to the application node which is promptly passing on to the application triggering the Read operation. (show details)

Symptom	IO Error
Environment	Linux Only
Trigger	Trigger Read on large 2GB file from app node and when read is in progress, mmshutdown the Gateway node..
Workaround	None

5.2.2.1

AFM

IJ53213

Remove dependency from kernel version for afmNFSNconnect. (show details)

Symptom	Unexpected Results
Environment	Linux Only
Trigger	on some kernel version less than 5.3 its possible to enable nconnect option.
Workaround	None

5.2.2.1

AFM

IJ53214

With FAL and NFS Ganesha enabled, running workloads with path to an NFS export for long periods of time could result in NFS client ips not being logged in the audit log. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	Wit FAL and NFS Ganesha enabled, run workloads with path to the NFS mount point for long periods of time
Workaround	Restart NFS Ganesha if NFS client ips are not being logged

5.2.2.1

File Audit Logging, NFS

IJ53324

In extremely rare case, directory entry with wrong length could be wrongly created leading to file system panic client node and log recovery failure on file system manager node. This could eventually lead to file system been unmounted everywhere. (show details)

Symptom	Cluster/File System Outage
Environment	ALL Operating System environments
Trigger	Creating new directory entry via file/link create.
Workaround	None

5.2.2.1

All Scale Users

IJ53325

tsapolicy server process waits until all client processes terminate before exit. But client process calls wait function incorrectly, which results in delay to exit. (show details)

Symptom	Performance Impact/Degradation
Environment	all platforms that support mmapplypolicy
Trigger	This problem could occur randomly when mmapplypolicy is run with multiple nodes
Workaround	none

5.2.2.1

mmapplypolicy

IJ52584

The sdrServ was not able to initialize due to the hostname resolution failure of the legacy server-based configuration server. This prevents GPFS daemon from coming up. (show details)

Symptom	Startup failure. Hostname resolution failure messages found in mmfs.log.
Environment	All
Trigger	Startup GPFS
Workaround	Temporarily fix the hostname resolution.

5.2.2.1

admin command

IJ53364

In our ESS cluster, mmhealth is showing "scale_ptf_update_available" for some cluster members that do not have the specified ptf update available. (show details)

Symptom	Performance Impact/Degradation
Environment	Linux Only
Trigger	auto ptf update checker
Workaround	None

5.2.2.1

Callhome

IJ53332

mmbackup command internally communicates with tsbuhelper process using a formatted string to get backup result and the format was changed in Spectrum Scale 5.1.9.0.
mmbackup should accept old format and new format both but fails to handle old format properly. As a result of it, the backup count from the node using old format is not correctly added up. (show details)

Symptom	Error output/message
Environment	all platforms that support mmbackup.
Trigger	This problem could occur if one of remote helper nodes has Spectrum Scale 5.1.8 or older version installed while master node has Spectrum Scale 5.1.9 or higher version installed.
Workaround	run mmbackup on the node where Spectrum Scale 5.1.8 or older version is installed

5.2.2.1

mmbackup

IJ53333

Add an option in mmafmctl to checkDeleted files and dirs which might be hogging the usedInodes count on the fileset. (show details)

Symptom	Unexpected Behavior.
Environment	Linux Only
Trigger	File/dir has been deleted at the cache/primary site but replication of the same is not completed to remote site. Fileset was stopped or disabled of AFM before this leading to permanent hold on the deleted inodes.
Workaround	Run a policy manually to see NLINK 0 inodes in the AFM fileset.

5.2.2.1

AFM

IJ53420

GPFS daemon could fail unexpectedly with assert after file system unmounted due to panic. (show details)

Symptom	Abend/Crash
Environment	ALL Operating System environments
Trigger	File system panic
Workaround	None

5.2.2.1

All Scale Users

IJ53421

Failed to register with GPFS: Bad file descriptor when SMB tries tree connect (show details)

Symptom	Crash
Environment	Linux Only
Trigger	A Samba process calls gpfs_register_cifs_export. That results in the process being registered in a table. This interface calls alloc_file() which triggers the issue.
Workaround	None

5.2.2.1

GPFS core

IJ53480

Assertion `*errP != E_OK || amP == NULL || amP->getPoolId() == poolId' in mmreclaimspaceDue to a coding error, instead of using poolIndex, poolId was used to access an internal array of disks in the storage pool, causing assertion. The assertion is hit when the file system has thin disks with more than one storage pools. (show details)

Symptom	Daemon fails with an assertion
Environment	All
Trigger	During mmreclaimspace, reclaimReservedThinSpaceInPool() invoked accessAllocMapById wrongly using 'poolIndex' instead of 'poolId' leading to the assertion.The assertion is hit when the file system has thin disks with more than one storage pools.
Workaround	None

5.2.2.1

thin provisioning

IJ53481

When monitoring gets stopped and restarted due to movement of pods an error occurred which prevented the communication of that change to the cluster manager. More likely in CNSA than a classic scale deployment (show details)

Symptom	Error output/message
Environment	ALL Linux OS environments
Trigger	When monitoring gets stopped and restarted due to movement of pods an error occurred which prevented the communication of that change to the cluster manager.
Workaround	As a possible workaround call mmhealth node show --resync -a.

5.2.2.1

System Health

IJ53482

The event points to a large pagepool combined with a not so fast network. This might not be an issue depending on the usage and therefore raises a tip which customers should be able to hide (acknowledge) to no longer see it in mmhealth or GUI. (show details)

Symptom	Error output/message
Environment	ALL Linux OS environments
Trigger	Incorrect classification of event.
Workaround	As a possible workaround change the trigger value pagepoolNetworkRatioPercent to a lower values than the default 10%.

5.2.2.1

System Health

IJ53372

GPFS leaks kernel memory every time a user that is a member in more than 32 groups tries to access an inode that denies access to that user through simple modebits (no ACL). This might go unnoticed, but if these conditions occur repeatedly, the kernel memory leak can affect the node operations, requiring a reboot to avoid outages. (show details)

Symptom	Abend/Crash (in the worst case that the kernel memory leak goes undetected, leading to OOM kills and node outage)
Environment	ALL Linux OS environments
Trigger	All Scale Users
Workaround	The only workaround would be reducing the number of groups to ensure that no user is a member in more than 32 groups.

5.2.2.1

All Scale Users

IJ53490

The timeout test result is not consistent on AMD EPYC-Genoa Processor. If the test passes, the GSKIT hangs workaround will not be applied. This causes problem later (show details)

Symptom	Installation and admin commands hang.
Environment	Linux OS environments
Trigger	This problem affects AMD EPYC-Genoa.
Workaround	Manually apply the workaround

5.2.2.1

Admin Commands, gskit

IJ53548

Attempting to set a timestamp in GPFS to a time before Jan 1 1970 results in an unexpected timestamp being stored. GPFS currently stores timestamps as a 32bit unsigned integer, and thus can store timestamps from Jan 1 1970 00:00:00 UTC to 7 February 2106 at 06:28:15 UTC. Setting a timestamp before 1970 was silently accepted. (show details)

Symptom	Unexpected Results/Behavior
Environment	ALL Linux OS environments
Trigger	Attempt to set timestamp on a GPFS inode before 1970, e.g.: touch -m -t 196001010000 testfile
Workaround	Avoid setting timestamps outside the supported range in GPFS.

5.2.2.1

All Scale Users

IJ53549

Concurrently issuing read system calls while mmap writeback is running for the same region in the same file can result in the mentioned assert being hit. This is due to a coordination problem between the mmap writeback and handling of read system calls. (show details)

Symptom	Abend/Crash
Environment	ALL Linux OS environments
Trigger	mmap a file and write to mmap region. Issue regular read calls concurrently to the background mmap writeback. Enabling HAWC seems to increase the likelihood of hitting this problem.
Workaround	The race condition leading to this assert always exists. During tests it was only hit with HAWC enabled, so disabling HAWC might help to make this less likely to hit.

5.2.2.1

All Scale Users

IJ53550

On AIX, GPFS commands may fail with sed: illegal option -i. This may occur if the cluster key has expired and GPFS commands tryto regenerate a temporarily certificate to get things going. (show details)

Symptom	Error output/message Upgrade/Install failure
Environment	AIX/Power only
Trigger	This issue affects AIX nodes with cluster keys has expired.
Workaround	If the cluster key expired, generate new keys and commit the newkeys before upgrade. Run GPFS commands on nodes that support -i option of sed command.

5.2.2.1

Admin Commands

IJ53560

Upon hitting SGPanic (due to OOS (Out Of Space) at the physical device(s)) in the file system with FCMs, there is (high) possibility that the diskSpaceState stays in DSS_OOS after the emergency space released. (show details)

Symptom	After the file system suffers with SGPanic, it will be unmounted. For recovery, the file system will be mounted in 'restricted' mode and the emergency space will be release (with 'mmreclaimspace --emergency-reclaim' command). However, after the emergency space is released, the file system would stay DSS_OOS as oppose to the expectation. Because of this, the file system can't be mounted in 'space-reclaim' mode which is a step required for the recovery.
Environment	Linux
Trigger	On a file system with FCM 4, fill up the file system until all physical capacity is used so that it can trigger SGPanic due to OOS (Out Of Space) condition.
Workaround	None

5.2.2.1

thin-provisioning

IJ53561

When a file system has multiple storage pools and not all of them are thin-provisioning enabled, the storage pool(s) that is(are) not thin-provisioning enabled ended up reserving the emergency space. This is not only unnecessary and a bug to be fixed. (show details)

Symptom	After a file system is mounted, grep 'thin inode' in the internal dump... it will show the following message for the storage pool that is not thin-provisioning enabled. [root@lothal-qa6-1 ~]# mmfsadm dump all \| grep "thin inode" 0: name 'system' Valid nDisks 32 nInUse 32 id 0 poolFlags 2 thin inode 41 nBlocks 4124 1: name 'data' Valid nDisks 4 nInUse 4 id 65537 poolFlags 0 thin inode 42 nBlocks 4131 <<<<<<< 2: name 'flash' Valid nDisks 8 nInUse 8 id 65538 poolFlags 2 thin inode 46 nBlocks 4131
Environment	Linux/AIX
Trigger	Mount a file system and check 'thin inode' in the internal dump (mmfsadm dup all).
Workaround	None

5.2.2.1

thin-provisioning

IJ53600

A Linux kernel change caused GPFS to break disk I/O into many small requests. (show details)

Symptom	Performance Impact/Degradation
Environment	ALL Linux OS environments with kernel version >= 5.1
Trigger	N/A
Workaround	None

5.2.2.1

All Scale Users

IJ53595

The AFM gateway node becomes unresponsive during startup due to numerous filesystem mount requests triggered by active I/O to multiple filesets. (show details)

Symptom	Performance impact
Environment	Linux Only
Trigger	Gateway node startup with multiple AFM filesets starting the recovery.
Workaround	None

5.2.2.1

AFM

IJ53593

Logging of failure when is to failed list file is causing deadlock within the mmafmcosctl binary. (show details)

Symptom	Deadlock.
Environment	Linux Only
Trigger	Having failures to log in the download/upload sub command of mmafmcosctl.
Workaround	Can run download/upload without --enable-failed-list-file and this problem shouldn't happen.

5.2.2.1

AFM

IJ53594

Earlier a fix for the same issue was made, but it was considering to return RESTART between the Gateway node and app node only whenqueue is dropped. But there can be cases where Gateway node is being shutdown without queue being in dropped state. (show details)

Symptom	IO Failure
Environment	Linux Only
Trigger	Trigger Read on a single large file from COS to Cache and meanwhile shutdown the fileset's gateway or start a gateway node that wasshutdown already.
Workaround	None

5.2.2.1

AFM

IJ53592

If its the first or only operation on the list and We attempt to queue it through startMarker, then We use the escaped path as opposed to unescaped path causing the failure in queueing the proper format file name. (show details)

Symptom	Unexpected behavior
Environment	Linux Only
Trigger	AFM
Workaround	Have more files without escaped sequence in name ahead of special character filenames which require escape.. In the list file given for download.

5.2.2.1

AFM

IJ52948

Kernel-Crash in Scale 5.2.1.1 - general protection fault and system crash.The crash happens due to a memory corruption after mounting a gpfs filesystem.Sometimes this happens during a filesystem mount and sometimes a little while after. (show details)

Symptom	Memory corruption and subsequent crash
Environment	Linux Only
Trigger	We do not need any particular kernel version. For example the customer that hit this issue was running 4.18.0-553.16.1.el8_10.x86_64. While I have reproduced this on a 6.4 kernel.The length of the fstab entry should be in a sweet spot. What this means that is the memory is allocated from the slab cache which have fixed sizes.This means we may have some extra room in the memory allocated to us till we reach the object boundary and we will not have any corruption till we cross this boundary.The kernel slabs are of object sizes: 8, 16, 32, 64, 96, 128, 192, 256, 512 and so on ..For the problem to appears, we need an fstab entry in which, after the gpfsdev=“fsname” options, there are a sizeable number of characters and options. This leads us to write a larger size then what we requested.
Workaround	None

5.2.2.1

Scale core

IJ53426

When a new file system manager takeover after old file system manager loses quorum, it is possible for new file system to read stripe group descriptor too early which can cause stripe group descriptor updates to be lost. (show details)

Symptom	Unexpected Results/Behavior
Environment	ALL Operating System environments
Trigger	File system manager loses quorum while running command that updates stripe group descriptor.
Workaround	None

5.2.2.0

All Scale Users