IBM Spectrum Scale APARs Resolved in 5.1.1.x

Assert: SGNotQuiesced sgmrpc.C (show details)

Symptom	Scale mmfsd daemon process crash
Environment	All
Trigger	Snapshot create or delete operations
Workaround	None

5.1.1.4

Core GPFS

IJ34886

When multiple nodes are creating files in the same directory, creates can slow down during recovery. (show details)

Symptom	Long Waiters
Environment	All
Trigger	File system crash
Workaround	None

5.1.1.4

Core GPFS

IJ34346

If FIPS is enabled, call home uploads fail; manual call home uploads crash with an error, mentioning FIPS. (show details)

Symptom	Component Level Outage
Environment	Linux
Trigger	Enabling FIPS
Workaround	Disable FIPS.

5.1.1.4

Call home

IJ34917

AFM gateway node crashes if the home is not responding while mounting the fileset target path. (show details)

Symptom	Crash
Environment	Linux
Trigger	AFM caching with unresponsive home
Workaround	None

5.1.1.4

AFM

IJ34927

logAssertFailed: exclLockWord == 0 (show details)

Symptom	Assert
Environment	POWER
Trigger	NA
Workaround	Disable the assert.

5.1.1.4

Core GPFS

IJ34928

Data loss which may happen when an application uses the direct I/O mode to write to a pre-allocated file block. (show details)

Symptom	Unexpected Results/Behavior
Environment	All
Trigger	Application uses the direct I/O mode to write to a pre-allocated file block.
Workaround	None

5.1.1.4

Core GPFS

IJ34931

Drives on an ESS3k may not show up after a boot or reboot of a canister. You can detect these errors using:

lspci -s 0x87 | grep DpcSta | grep Trigger+ or lspci -s 0x3c | grep DpcSta | grep Trigger+

(show details)

Symptom	Component Level Outage
Environment	Linux (x86_64)
Trigger	ESS 3000 boot or reboot canister (very rare)
Workaround	You can use the setpci utility to manually clear the DPC error flag of the 0x87 and 0x3c busses. This will force the devices to attempt to train. If drive still does not train, there is some other issue going on.

5.1.1.4

ESS, GNR

IJ33911

mmhealth encryption component shows "checking" instead of "healthy". The check differs from other components and is not refreshed by a timer but just by incoming events. Must start with "healthy" status because of that. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	Encryption active
Workaround	For the encryption component, read "checking" as "healthy" as no error or warning events have happened.

5.1.1.3

System health

IJ33948

If a file system is set to maintenance mode then it is listed as 'SUSPENDED', but only a 'unmounted_fs_check' event is shown as the reason. It should say 'maintenance state' instead. (show details)

Symptom	Error output/message
Environment	All
Trigger	The 'fs_maintenance_mode' event is only at info-level, since it is a user intended state. Info-level events are in general not reported by 'mmhealth node show' since they do not indicate an issue or error state. A code change was done to allow the 'fs_maintenance_mode' event to be listed as a reason.
Workaround	None

5.1.1.3

System health

IJ33949

On a cluster with two quorum nodes and tiebreaker disks, an unexpected quorum loss can be seen on the challenger node when the current cluster manager sees a mmshutdown (or node reboot). (show details)

Symptom	File System Outage (unexpected GPFS file system unmount for about 30 seconds)
Environment	All
Trigger	GPFS shutdown (mmshutdown) or node reboot of current cluster manager
Workaround	Move the cluster manager role by using the 'mmchmgr -c ' command.

5.1.1.3

Cluster Manager

IJ33997

Fileset path is taking the chars count to read the actual mount path from a given directory path. If the directory mount has the same chars till the count then prefetch starts processing successfully. (show details)

Symptom	dirpath is checked properly.
Environment	All
Trigger	Prefetch starts working on an invalid path where the dir path doesn't belong to the same fileset.
Workaround	None

5.1.1.3

AFM

IJ34000

GPFS has fileset level permissions which can deny setting the mode or EAs on the fileset entities depending on which mode this targets. AFM doesn't consider this flag on the fileset and that results in getting E_PERM from the home which causes the queue to get stalled. Normal queue goes fine, but mostly recovery or resync queue hits this issue. (show details)

Symptom	Unexpected Results
Environment	Linux
Trigger	Set the same fileset level permissions flag (setAclOnly, ChmodOnly, chmodAndUpdateAcl, etc.) on both Cache/Primary and/or Home/Secondary sites and perform IO to the fileset and then run recovery or resync.
Workaround	Drop those operations that stall the queue when the fileset level permissions are enabled only at one of the 2 sites.

5.1.1.3

AFM

IJ34001

mmkeyserv client register, deregister or rkm change command will fail if the new RKM.conf contains expired certificates. (show details)

Symptom	Error output/message Unexpected Results/Behavior
Environment	Linux, Windows (x86_64)
Trigger	This occurs when there is a client that is registered to multiple tenants and the certificate has expired or when there are multiple clients that are registered to at least one tenant and the their certificates have expired.
Workaround	Use the mmkeyserv client update to update the client certificates. Otherwise, shut down GPFS and retry the command.

5.1.1.3

Admin commands, Encryption

IJ34002

When a file system has a high number of block allocation regions, the processing of allocation manager RPC could be slower than expected. (show details)

Symptom	Hang/Deadlock/Unresponsiveness/Long Waiters
Environment	All
Trigger	Running mmdf
Workaround	Avoid running mmdf.

5.1.1.3

Core GPFS

IJ34136

With thousands of client nodes mounted in the file system, adding some more disks serviced by ESS 3000 nodes can cause long waiters trying to get NSD disk information on each client node. (show details)

Symptom	Stuck mmadddisk command
Environment	Linux
Trigger	Create new NSD disks from an ESS 3000 or an ECE cluster and add them to a file system before starting the GPFS service.
Workaround	Restart the GPFS service on the ESS 3000 or ECE nodes, or fail over the RG master from one node to the other one.

5.1.1.3

ESS 3000, ECE, Admin commands

IJ34142

The automatic restart of NFS (remedy action) is blocked by an open unmounted_fs_check event which is not relevant for NFS/SMB exports. (show details)

Symptom	Performance Impact/Degradation
Environment	Linux (CES nodes running NFS)
Trigger	File systems with the automount-Flag and an unmounted file system
Workaround	Remove the "automount" flag from the "testFS" file system.

5.1.1.3

System health

IJ34144

RAS event dir_sharedroot_perm_problem was received by mmhealth sometimes without a need and sometimes with a need, but the description of the event does not describe what is wrong with the permissions and which permissions should be provided. (show details)

Symptom	Error output
Environment	Linux
Trigger	cesSharedRoot does not have permissions 'rx' for 'group' and 'others'.
Workaround	Provide the necessary permissions for cesSharedRoot ('rx' for 'group' and 'others').

5.1.1.3

System health

IJ34145

The Mellanox firmware manager was called frequently (around every minute) by the system health monitor. That caused a high CPU load. (show details)

Symptom	Performance Impact/Degradation
Environment	All
Trigger	The Mellanox firmware check is executed too frequently by the system health monitor. There is no need for so much checking.
Workaround	None

5.1.1.3

System health

IJ34151

The timestamps displayed in the output of "mmdiag --iohist" on Windows nodes may show incorrect values, especially for the decimal part of the seconds. This may also cause incorrect duration reporting of the affected I/O operations. (show details)

Symptom	Unexpected Results/Behavior
Environment	Windows (x86_64)
Trigger	None
Workaround	None

5.1.1.3

Admin commands

IJ34152

mmsysmon daemon does not start and mmhealth does not work on AIX. (show details)

Symptom	Component Level Outage
Environment	AIX
Trigger	Installing Scale 5.1.1.0-5.1.1.2 on an AIX node.
Workaround	1. In the file /usr/lpp/mmfs/lib/mmsysmon/CallhomeUpdateRequest.py remove the line "import requests" 2. Restart Sysmonitor on this node: mmsysmoncontrol restart

5.1.1.3

System health, Call home, GUI

IJ34190

Ganesha fails to open files when over 1 million files are open. (show details)

Symptom	Check for logs "Futility count exceeded. Client load is opening FDs faster than the LRU thread can close them." and values of current_open and former_open.
Environment	Linux
Trigger	Whenever a client opens more than 1 million files.
Workaround	None

5.1.1.3

CES NFS

IJ34194

When an application reads with an IO size that is a multiple of the GPFS block, prefetching doesn't start until the application issue a second read request unless the read starts at the beginning of the file or prefetchAggressiveness is set to prefetchOnFirstAccess. This can cause slow read performance when read IO size is very big. (show details)

Symptom	Performance Impact/Degradation
Environment	All
Trigger	Application issue read with IO size that is much bigger than the GPFS block size.
Workaround	Set prefetchAggressiveness configuration to prefetchOnFirstAccess or reduce the read IO size to the GPFS block size.

5.1.1.3

Core GPFS

IJ34200

When the mmchmgr command is used to assign a new file system manager, it could fail with "No log available" message after the current file system panics with "No log available" error. This can happen if file system is not externally mounted on any node. (show details)

Symptom	Error output/message
Environment	All
Trigger	Using the mmchmgr command to assign a new file system manager.
Workaround	Mount the file system before issuing the mmchmgr command.

5.1.1.3

Core GPFS

IJ34221

Too many slots are reported by tslsenclslot for an LSI enclosure which reports duplicate enclosure ids. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	Users with LSI Megaraid enclosures which have repeated eidx values when using the 'storcli /call/eall show all j' command.
Workaround	None

5.1.1.3

ESS, GNR

IJ34289

AFM gateway may assert if the home server is not responding during a prefetch. (show details)

Symptom	Crash
Environment	Linux
Trigger	AFM prefetch
Workaround	Stop prefetch until the efix is installed.

5.1.1.3

AFM

IJ34315

After remote error 2, fileset went to NeedResync state. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	The fileset is getting replicated to COS and there is a rename operation in the queue.
Workaround	None

5.1.1.3

AFM

IJ34389

Running online fsck in repair mode (-o -y) can cause it to detect and repair false positive lost blocks (i.e. blocks that are assigned to files) and mark it as free, and doing this can lead to duplicate block corruptions. (show details)

Symptom	Data corruption due to duplicate blocks
Environment	All
Trigger	Running online fsck in repair mode (-o -y)
Workaround	Use offline fsck to fix corruptions.

5.1.1.3

Online FSCK

IJ34393

Hard lockup between 2 pemsmod kernel threads can panic the kernel. Stack trace at vmcore-dmesg.txt will have something like this: [88432.803601] CPU: 27 PID: 14563 Comm: pemsRollUpQueue Kdump: loaded Tainted: G (show details)

Symptom	Kernel crash
Environment	Linux (x86_64)
Trigger	System running heavy I/O workload
Workaround	None

5.1.1.3

ESS, GNR

IJ34170

The timestamps displayed in the output of "mmdiag --iohist" on Windows nodes may show incorrect values, especially for the decimal part of the seconds. This may also cause misreporting of the duration of the affected I/O operations. (show details)

Symptom	Unexpected Results/Behavior
Environment	Windows (x86_64)
Trigger	Running "mmdiag --iohist" on Windows
Workaround	None

5.1.1.3

Admin commands

IJ34251

Too many slots are reported by tslsenclslot for an LSI enclosure which reports duplicate enclosure ids. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	Users with LSI Megaraid enclosures which have repeated eidx values when using the 'storcli /call/eall show all j' command.
Workaround	None

5.1.1.3

ESS, GNR

IJ32947

On an AIX node, on some occasions, including when the /var file system becomes full, mmfsd is unable to run child processes, and that results in different failures, depending on the process which mmfsd attempts to run. Among the operations which have been seen to fail:

- mmadddisk

- mmauth

Once the problem is triggered, it will remain until the mmfsd daemon is restarted. If the problem is initiated by the /var file system getting full, freeing up space on that file system is not enough to solve the problem. An indication that problem is taking place is in the output of the /usr/lpp/mmfs/bin/tslsfs nonexistent_FS command (that is, passing the name of a nonexistent file system as parameter to the command above)

In a system where the problem is occurring, the output will be mmcommon getEFOptions nonexistent_FS failed. Return code 1. While on a system without the problem, the output will be mmcommon: File system nonexistent_FS is not known to the GPFS cluster.

(show details)

Symptom	Unexpected Results/Behavior
Environment	AIX
Trigger	A likely trigger for the problem is the /var file system being filled, possibly around the time an operation is taking place that results in information being produced to the mmfs.log file.
Workaround	Once the issue in /var is resolved, restart mmfsd.

5.1.1.2

Core GPFS

IJ33003

While using IBM Spectrum Scale Erasure Code Edition running on LSI MegaRaid adapters, if the slotmap.yaml file is edited directly, several unintended consequences can arise that would not show up when using the drive mapping utility. This can include allowing several disallowed characters such as the hyphen in the location code name. (show details)

Symptom	Error output/message Unexpected Results/Behavior
Environment	Linux
Trigger	Users who edit the slotmap.yaml file on LSI MegaRAID systems may be affected.
Workaround	Avoid using leading "0" when creating slot location codes. After editing a slotmap.yaml file, run tslsenclslot.lmr --check-slot-map to verify that the mapped location codes are valid and have the expected form.

5.1.1.2

ESS, ECE, GNR

IJ33049

In the current implementation of eviction on a file, the eviction program acquires a DMAPI lock first on the file and punches a hole on it. The program can be terminated at any point without the DMAPI lock being released, causing a lock leak and hence later DMAPI lock acquire on the file can deadlock and the only way to come out of this is to bounce mmfsd. (show details)

Symptom	Deadlock
Environment	Linux, AIX
Trigger	Trying to evict a file or list of files, and the eviction getting killed midway through.
Workaround	None

5.1.1.2

AFM

IJ33082

If a new file is created and it is renamed before AFM could replicate it to the COS, with parallel IO enabled, incorrect target path is sent to the worker gateway node causing the remote error 2. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	AFM replication with parallel IO enabled
Workaround	Disable parallel IO.

5.1.1.2

AFM

IJ33084

Mounting a file system can hang. (show details)

Symptom	File system mount hangs.
Environment	Linux
Trigger	Mounting a file system.
Workaround	None

5.1.1.2

Core GPFS

IJ33095

Assert "(verify == 0) || (ofP == __null) || (ofP->sgP == __null) || ofP->isRoSnap() || (ofP->metadata.getInodeStatus() != 1) || (!ofP->sgP->isFileIncludedInSnapshot(ofP->getInodeNum(), ofP->getSnapId(), getInodeStatus())) || (ofP->assertInodeWasCopiedToPrevSnapshot()) || (ofP->isBeingRestriped() || ofP->beenRestriped)". (show details)

Symptom	Daemon crash
Environment	All
Trigger	Operations triggering a statlite call on a node without sufficient stat file token.
Workaround	Disable the statlite config parameter by "mmfschconfig statliteMaxAttrAge=0 -i".

5.1.1.2

Core GPFS

IJ33103

afmParallelMounts option can be enabled at the fileset level which creates parallel mounts to the different NFS servers. Some dentries created as part of the parallel mounts may not connect to the file system root dentry and cause a VFS busy inodes issue when the fileset is stopped or unlinked. (show details)

Symptom	Unexpected results
Environment	Linux
Trigger	AFM replication with afmParallelMounts enabled
Workaround	Disable afmParallelMounts.

5.1.1.2

AFM, AFM DR

IJ33163

This occurs on a compliant or compliant-plus mode fileset, when the immutable files inside them remain as is. When such files are taken for AFM replication, the files in the Resync/Recovery path can set immutable attribute at secondary and also remove the write flag. This ends up being seen as a data mismatch in terms of ACLs between the sites. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	None
Workaround	None

5.1.1.2

AFM

IJ33173

During reconnect in the middle of a write operation, the below error may be reported: 2021-03-30_12:59:35.050-0400: [W] Encountered first checksum error on network I/O from NSD Client 10.10.10.10 (show details)

Symptom	IO error
Environment	Linux (s390x)
Trigger	Network is not good which can lead to TCP connections reconnect.
Workaround	None

5.1.1.2

Core GPFS

1IJ33174

Compliant and Compliant-Plus fileset modes can stall the queue. (show details)

Symptom	Unexpected Behavior
Environment	Linux
Trigger	Role reversal in compliant IAM mode, with the filesets having Immutable files with expiration time set on them.
Workaround	None

5.1.1.2

AFM, AFM DR

IJ33177

When compiling gpfs.gplbin rpm packages on RHEL8 for multiple kernel versions, installing them at the same time might fail due to conflicting build ids in the packages. (show details)

Symptom	Upgrade/Install failure
Environment	Red Hat Enterprise Linux 8.x
Trigger	RHEL8 RPM builds
Workaround	Remove the installed gpfs.gplbin package before installing the new one.

5.1.1.2

Core GPFS

IJ33190

IBM Spectrum Scale on an AIX node will crash when trying to put a NFSv4 ACL on a .snapshots directory (e.g. through the "aclput -t nfs4" command). (show details)

Symptom	Abend/Crash
Environment	AIX
Trigger	Storing NFSv4 ACL on .snapshots directory on an AIX node.
Workaround	Do not try this operation.

5.1.1.2

Core GPFS

IJ33365

mmnetverify creates temporary test files when validating network functionality. After mmnetverify executes, the temporary test files still exist on the tested nodes of the cluster. (show details)

Symptom	Accumulation of files in /var/mmfs/tmp and /tmp directories
Environment	Linux
Trigger	Running mmnetverify will cause the test files to be created.
Workaround	Run this command: mmdsh -N all rm -rf /var/mmfs/tmp/copy_file.* /tmp/copy_file.*

5.1.1.2

mmnetverify

IJ33366

readddir operation fails after a rename operation on the AFM object fileset if we are in independent-writer mode due to incorrect updates of the remote attributes. (show details)

Symptom	Unexpected Results
Environment	All
Trigger	Rename on an AFM Object fileset in IW mode.
Workaround	None

5.1.1.2

AFM

IJ33367

A failover situation was generated by the NFS health monitor while a node was expelled in the cluster. The NFS service monitor detected a potential hung situation. As a result, a failover was triggered even though the system was able to recover itself after several minutes. (show details)

Symptom	Performance Impact/Degradation
Environment	Linux (CES nodes)
Trigger	The NFS service monitor detected a potential hung situation, which means that the NFS NULL check failed and the number of internal NFS operations did not increase over a while (around 60 seconds). During that time NFS is in a grace mode (allow previous clients to reclaim their locks) and therefore not able to let new clients start their I/O work. This grace time was not considered by the systemhealth monitor, but it should increase the waiting time.
Workaround	The systemhealth monitor can be configured with a configuration option to signal a degraded state (nfs_unresponsive event) instead of triggering a failover (nfs_not_active event, error state).

5.1.1.2

System health

IJ33368

The mmces events active object command is failing because object is not a valid option. (show details)

Symptom	Unexpected Behavior
Environment	Linux (CES nodes)
Trigger	The OBJ option was not included in the mmces events active command.
Workaround	None

5.1.1.2

CES

IJ33418

When the fileset moves to unmounted or disconnected states, there is a window where SETXATTR operations from SW cache can get queued to a non-GPFS home site and remain queued for ever. (show details)

Symptom	Unexpected Behavior
Environment	Linux (AFM gateway nodes)
Trigger	Fileset moving to unmounted or disconnected state and running SETXATTR operations from SW cache
Workaround	1. Drop the SetXattr operation that gets queued using the "mmfsadm afm msgdrop" command. or 2. Ensure that fileset is always active/dirty before performing SETXATTR operations at the SW/IW cache site.

5.1.1.2

AFM

IJ33419

mmafmctl command has a provision to reclaim deleted inodes when Resync is manually run. This causes the inodes to be reclaimed only under some conditions and not reclaimed in other cases. (show details)

Symptom	Unexpected Behavior
Environment	Linux (AFM gateway nodes)
Trigger	Running manual resync on fileset without recovery, with afmSkipResyncRecovery flag set on the fileset level.
Workaround	1. A full resync and then recovery needs to be run in order to reclaim deleted inodes. 2. A resync with afmSkipResyncRecovery at the cluster level tuned using the mmchconfig command should be run. Fileset level tuned doesn't work currently. 3. Worst case, an mmfsck needs to be run to reclaim the inodes.

5.1.1.2

AFM

IJ33420

When a thread is performing shutdown and a thread is initiating startup run concurrently, it is possible that it could result in a kernel crash. (show details)

Symptom	Abend/Crash
Environment	Linux
Trigger	Very small race window in GPFS cleanup process
Workaround	None

5.1.1.2

Core GPFS

IJ33421

If a Linux node is overloaded and the thread cannot be scheduled quickly could result in a kernel panic: RIP list_del_entry_valid.cold. (show details)

Symptom	Abend/Crash
Environment	Linux
Trigger	mmshutdown on a busy Linux node
Workaround	None

5.1.1.2

Core GPFS

IJ33254

AFM might incorrectly drop write messages during an AFM recovery, causing the data mismatch between cache or primary and home or secondary cluster. AFM recovery is triggered if in-memory queue is lost, for example, a gateway node restart. With parallel IO enabled, WriteSplit messages are sent to the worker gateway nodes to write the file parallelly. If the WriteSplit message fails on the worker gateway node, failed WriteSplit request is retried for 3 times before dropping the request. Since the Write request is dropped without replicating the data to the home or secondary, it will result in data mismatch between the cache or primary and home or secondary. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	AFM recovery with parallel IO enabled
Workaround	Disable parallel IO using the command "mmchfileset device fileset -p afmParallelWriteThreshold=disable"

5.1.1.2

AFM, AFM DR

IJ33530

AFM gateway node crash when the home or secondary is not responding. (show details)

Symptom	Crash
Environment	Linux
Trigger	AFM replication when the home or secondary is not responding.
Workaround	None

5.1.1.2

AFM, AFM DR

IJ33532

When mmafmcosctl upload command is used with --all option, AFM LU (local-updates) mode uploads incorrect object name (old name) if the file was already renamed at the cache. (show details)

Symptom	Unexpected results
Environment	Linux
Trigger	Uploading the renamed objects from the LU mode cache.
Workaround	None

5.1.1.2

AFM

IJ33535

tsenclstat causes a coredump with segmentation fault whenever it runs on a system with only one SAS adapter hooked up to a storage enclosure. This most commonly occurs in a "daisy-chain" configuration. (show details)

Symptom	Error output/message
Environment	All
Trigger	This issue affects customers with daisy-chain storage enclosure configurations in which only one SAS adapter is connected from the server to a storage enclosure. It occurs whenever tsenclstat runs, which will occur automatically every few minutes as part of the daemon's regular status check.
Workaround	Ensure there are two SAS adapters hooked up to each storage enclosure.

5.1.1.2

ESS, GNR

IJ33568

cNFS does not work on RHEL8.x. This is due to a change in pid of commnand in RHEL8. (show details)

Symptom	Unexpected Results/Behavior, Node Reboot
Environment	Red Hat Enterprise Linux 8.x
Trigger	Enabling cNFS on RHEL8.x nodes.
Workaround	Downgrade or upgrade procps-ng to procps-ng-3.3.15-3.el8

5.1.1.2

cNFS

IJ33567

AFM Prefetch is not generating the prefetch end callback event registered through the afmPrepopEnd event. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	Register for afmPrepopEnd callback event, and run AFM prefetch with list file or directory option.
Workaround	None

5.1.1.2

AFM

IJ33607

[X] logAssertFailed: numaNodesP[node].numaNode != -2 in mmfs.log.latest and daemon will not start (show details)

Symptom	Abend/Crash
Environment	Linux
Trigger	One or more NUMA nodes without any CPU or memory resources.
Workaround	Reallocate the LPAR and ensure there are no NUMA nodes without any CPU or memory resources.

5.1.1.2

NUMA Awareness

IJ32097

If the disks for a file system are not ready to be used yet and the command "mmfsadm dump deferreddeletions" is run at the same time, the command will fail with the side effect of causing a long waiter 'waiting for SG cleanup' when the file system is deleted and recreated. (show details)

Symptom	Long Waiters
Environment	All
Trigger	NA
Workaround	None

5.1.1.1

Core GPFS

IJ32159

Operations requiring allocation of full metadata blocks. Examples: Expand number of allocated inode Create new independent fileset. (show details)

Symptom	Performance Impact/Degradation
Environment	All
Trigger	Operations requiring allocation of full metadata blocks. Examples: Expand number of allocated inode Create new independent fileset.
Workaround	Add more disks to the system pool.

5.1.1.1

Core GPFS

IJ32186

There appears to be an issue at the systemd layer that causes startup service to fail with connection time out during reboot. If auto load is set to yes, GPFS may not be able to start up or it may get stuck waiting for the environment to be initialized. (show details)

Symptom	GPFS does not start after a reboot.
Environment	Linux
Trigger	This issue affects cluster with auto load set to yes and hitting systemd connection time out during reboot.
Workaround	Manually restart GPFS.

5.1.1.1

GPFS startup, CCR, systemd

IJ31735

gpfs_next_inode and gpfs_stat_inode APIs returns inode 0 as the first inode with an invalid state. (show details)

Symptom	Unexpected result
Environment	All
Trigger	gpfs_next_inode/gpfs_stat_inode APIs
Workaround	None

5.1.1.1

GPFS APIs

IJ31841

When getting the stats of a file, users could run into the assert:

"Assert exp((verify == 0) || (ofP == __null) || (ofP->sgP == __null) || ofP->isRoSnap() || (ofP->metadata.getInodeStatus() != 1) || !ofP->sgP->isFileIncludedInSnapshot(ofP->getInodeNum(), ofP->getSnapId(), getInodeStatus())) || (ofP->assertInodeWasCopiedToPrevSnapshot()) || (ofP->isBeingRestriped() || ofP->beenRestriped)"

if there are writes to the same file from other nodes. (show details)

Symptom	Daemon crash
Environment	Linux
Trigger	Getting the lite stat of a file while writes are in progress from other nodes.
Workaround	Run the mmchconfig command to reset the configuration "statliteMaxAttrAge=0", which will disable the statlite and avoid this problem, but it may also impact the writes performance on the other nodes as well.

5.1.1.1

gpfs_statlite API

IJ32218

AFM prefetch fails with "too many open files" error. (show details)

Symptom	Unexpected results
Environment	All
Trigger	AFM prefetch
Workaround	None

5.1.1.1

AFM

IJ32219

AFM logs an error 124 (error not supported), when the Control file is not available at home site (non-GPFS home site). (show details)

Symptom	Daemon crash
Environment	Linux gateway nodes
Trigger	Try to set EAs on a file when home is a non-GPFS node which doesn't contain the AFM control file.
Workaround	None

5.1.1.1

AFM

IJ32223

After converting legacy recovery group to mmvdisk managed recovery group, poor write performance is observed from an application and the GPFS daemon did not come up because of OOM issue on some nodes. (show details)

Symptom	Abend/Crash Performance Impact/Degradation
Environment	Linux
Trigger	When converting legacy recovery group to mmvdisk managed recovery group by using the following command: mmvdisk recoverygroup convert --recovery-group RgName[,RgName] --node-class NcName
Workaround	Use the following command to reset the pagepool to 60%: mmvdisk server change --node-class NcName --pagepool 60% --recycle one

5.1.1.1

ESS, GNR

IJ32226

When users run the mmlsfileset command, it doesn't show the junction paths of some fileset randomly. (show details)

Symptom	Unexpected results
Environment	All
Trigger	One fileset's root directory has been corrupted for an unknown reason.
Workaround	None

5.1.1.1

Fileset

IJ32227

logAssertFailed: isNotCached() at ShHashS.C (show details)

Symptom	Abend/Crash
Environment	All
Trigger	The race occurs between the initialization and release of the indirect block descriptor.
Workaround	This assert can be safely ignored by using mmchconfig disableAssert='ShHashS.C:5400-5800:isNotCached()'

5.1.1.1

Core GPFS

IJ31571

When mmchattr is issued with "--no-attr-ctime", it should not end with ctime update. (show details)

Symptom	Unexpected results
Environment	All
Trigger	mmchattr --no-attr-ctime
Workaround	None

5.1.1.1

Core GPFS

IJ32238

The systemhealth monitor did not detect all paths for RDMA support (libibverbs.so library) on Ubuntu machines. Therefore, it reports a "ib_rdma_libs_wrong_path" issue. (show details)

Symptom	Error output/messages
Environment	Ubuntu Linux
Trigger	The issue shows up on Ubuntu machines with RDMA in use.
Workaround	None

5.1.1.1

System health

IJ32245

Command: err 46: tsunlinkfileset -f after mmunlinkfileset commands are invoked. (show details)

Symptom	Unable to unlink or delete the fileset which encountered this error.
Environment	All
Trigger	Invoking the mmunlinkfileset command
Workaround	Reboot the node and retry the mmunlinkfileset command.

5.1.1.1

Filesets

IJ32287

Application performance degradation while running on AFM filesets. (show details)

Symptom	Performance Impact/Degradation
Environment	Linux
Trigger	AFM replication
Workaround	None

5.1.1.1

AFM, AFM DR

IJ32344

When fileset creation occurs using mmafmcosconfig with --perm option, the file entries of this fileset are created with 700(default) permission instead of the value specified with --perm. (show details)

Symptom	Permission set to default 700.
Environment	Linux
Trigger	The file entries get listed from fileset root path.
Workaround	None

5.1.1.1

AFM

IJ32481

AFM recovery may incorrectly delete the files at home or secondary if there is any network issues while doing the home readdir. (show details)

Symptom	Unexpected Results
Environment	Linux
Trigger	AFM recovery
Workaround	Resync the fileset if there are any missing files at home.

5.1.1.1

AFM, AFM DR

IJ32506

Assert exp(isFastCondvarPrepSignal(fcLockP->ul) && fcLockP->lw.slot < 16384) in line 4570 of file /project/sprelmax511/build/rmax511067B2b/src/avs/fs/mmfs/ts/tasking/dSynch.C (show details)

Symptom	Abend/Crash
Environment	Linux
Trigger	Race condition between two RDMA threads
Workaround	None

5.1.1.1

RDMA

IJ32507

When a dependent fileset is created and linked under AFM independent fileset, ACLs form the home dependent fileset are not fetched and set at the cache dependent fileset. This happens only for the dependent fileset root path. (show details)

Symptom	Unexpected Results
Environment	Linux
Trigger	AFM caching with dependent filesets
Workaround	None

5.1.1.1

AFM

IJ32521

When enabling or disabling the rapid repair functionality with the mmchfs command and the file system panics at the same time, the log recovery could fail due to the logs generated by rapid repair. (show details)

Symptom	I/O error for log recovery
Environment	All
Trigger	File system panic happens while the rapid repair is being enabled or disabled.
Workaround	None

5.1.1.1

Rapid repair

IJ32560

Copy of uncached file from Samba share fails with object backend while writing data to cache.There is another issue if setXattr operation is in the queue, a sync read for the same file fails to return data to the application. (show details)

Symptom	Read operation fails.
Environment	Linux
Trigger	Read uncached file from Samba share of the AFM cache.
Workaround	Add node names to the /etc/hosts file.

5.1.1.1

AFM

IJ32553

AFM prefetch fails with error 238 if the prefetch list file contains symlinks and if their target paths do not exist as part of the same fileset. (show details)

Symptom	Unexpected results
Environment	Linux
Trigger	AFM prefetch
Workaround	None

5.1.1.1

AFM

IJ32554

Issuing a "mmchnode --daemon-interface" attempts to change the cluster configuration repository (CCR). When this mmchnode is issued from a Windows node, CCR gets committed with invalid IPv4 information, rendering the cluster in a non-working state. (show details)

Symptom	The mmchnode command fails with a message trail resembling: 'mmchnode: Unable to commit new changes.' 'mmchnode: [E] The command was unable to reach the CCR service on any quorum node. Ensure the CCR service (mmfsd or mmsdrserv daemon) is running on all quorum nodes and the communication port is not blocked by the firewall.' 'mmchnode: 6027-1271 Unexpected error from function setRunningCommand. Return code: 149'
Environment	Windows (x86_64)
Trigger	Issuing "mmchnode --daemon-interface" command on a Windows node specifying an alternate IPv4 address.
Workaround	None. A manual CCR restore (mmsdrrestore --ccr-repair) may be necessary to restore the cluster to a working state.

5.1.1.1

CCR

IJ32554

The Linux fallocate(2) API doesn't work correctly on Spectrum Scale file systems when punching a hole beyond the end of file. (show details)

Symptom	Punching a hole beyond the end of a file fails with EINVAL(22) error.
Environment	Linux
Trigger	Punching a hole through the Linux fallocate(2) API.
Workaround	None

5.1.1.1

fallocate(2)

IJ32608

With the introduction of the 5-level page tables, supported by Intel's Ice Lake processor generation, user space memory gets expanded by a factor of 512. This resulted in the change of kernel base address and due to this, GPFS asserts with message "logAssertFailed: (UIntPtr)(vmallocStart)" while validating kernel addresses. (show details)

Symptom	Assert
Environment	Linux (x86_64)
Trigger	Systems that attempt to install Spectrum Scale on a newer Intel x86_64 processor with 5-level page.
Workaround	Disable 5-level page table setting by adding no5lvl to the kernel command line and then rebooting the node. Check the documentation of the Linux distribution used for details on how to apply this change. For example, on RHEL8: # grubby --update-kernel=ALL --args="no5lvl" # cat /proc/cmdline BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-240.10.1.el8_3.x86_64 root=/dev/mapper/rhel-root ro crashkernel=auto resume=/dev/mapper/rhel-swap rd.lvm.lv=rhel/root rd.lvm.lv= rhel/swap rhgb quiet net.ifnames=0 biosdevname=0 no5lvl

5.1.1.1

Core GPFS

IJ32627

When doing preallocation and writes (e.g., Spectrum Protect Plus copy restore), the block usage of the file system is bigger than the total data size of these files. (show details)

Symptom	More disk space usage than expected.
Environment	All
Trigger	Preallocate the data blocks of the file, and then write as much data as the file size.
Workaround	Issue this command: mmchattr --compact=fragment

5.1.1.1

Disk space preallocation of files

IJ32628

When the mmdf: command is run from a directory where the current working directory has become stale (directory was deleted after going to it) the command states it was run from an invalid directory. (show details)

Symptom	Command states it was run from an invalid directory. But the command fails with various additional errors.
Environment	All
Trigger	Running the mmdf command from a directory that is stale (directory was deleted after going to it).
Workaround	Only use mm commands in a valid current working directory. Move to a directory that still exists within the node's file systems.

5.1.1.1

Core GPFS

IJ32632

Long waiters when running file audit logging or watch folder (show details)

Symptom	Long Waiters
Environment	All
Trigger	Heavy stress on audited or watched filesystems or filesets.
Workaround	None

5.1.1.1

Watch folder, File audit logging

IJ32648

GPFS allows the length of NSD names to be up to 255 characters and there are no rules that say it must contain an alpha. If there are NSD names with all digits and long enough, this can be a problem. With long digit names, two NSDs can incorrectly be identified as the same NSD. (show details)

Symptom	Error output/message Unexpected Results/Behavior
Environment	All
Trigger	Long NSD names of all digits
Workaround	Add an alphabetic character in to the NSD name.

5.1.1.1

Core GPFS, Admin commands

IJ32649

On HAWC enabled file systems, if the file system has 'down' disks, it causes replica mismatch after the file system repair. (show details)

Symptom	Operation failure due to file system corruption
Environment	All
Trigger	Writes to HAWC enabled file systems which has 'down' disks
Workaround	None

5.1.1.1

HAWC

IJ32651

When a disk is down, we may hit assertion Assert exp(!addrDirty && !synchedStale) in line 6446 of file bufdesc.C while doing directory block merge if this block is in this down disk. (show details)

Symptom	Abend/Crash
Environment	All
Trigger	Disk is down.
Workaround	None

5.1.1.1

Core GPFS

IJ32666

logAssertFailed: mdiWorkingIndexP[entryIndex].wSlotAddr == slots line 4956 mdIndex.C when doing recovery group master recovery (show details)

Symptom	Abend/Crash
Environment	Linux (x86_64, PPC64, PPC64LE)
Trigger	RG master failure which causes recovery.
Workaround	None

5.1.1.1

ESS, GNR

IJ32667

Offline fsck will not be able to repair all corruptions when using the option of applying patch file ((i.e mmfsck FSchk -v --patch-file path-towrite-patchfile --patch) to repair the corruptions When repairing corruption by applying patch file the fsck output would show the below messages indicating the issue:

---------------- Invalid BlockType Inode. Skipping patch. ---------------

(show details)

Symptom	Error output/message and all corruptions not fixed
Environment	All
Trigger	Offline fsck repairing corruptions by applying patch file.
Workaround	Run fsck repair with the regular option -y to fix the corruptions.

5.1.1.1

FSCK

IJ32668

There's no more difference in format of stdout when running "mmces state cluster NFS" or "mmces state cluster NFS -Y". With former versions, a nice table list was generated when not using -Y. But, the current output of mmces state cluster NFS is:

# mmcesstatecluster::HEADER:version:reserved: reserved:NODE:COMPONENT:STATE:EVENTS: mmcesstatecluster::0:1::::nas22ces01-i:NFS:HEALTHY: csm_resync_forced,no_longwaiters_found.ccr_quorum_nodes_ok, service_running,node_resumed.nfs_dbus_ok,node_resumed,dns_found, dns_found:mmcesstatecluster::0:1::::nas22ces02-i:NFS:HEALTHY:csm_resync_forced, ccr_quorum_nodes_ok,nlockmgr_rpcinfo_ok,mountd_rpcinfo_ok,service_running, node_resumed,nfs_rpcinfo_ok,nfsd_up,nfs_dbus_ok,wnbd_up,node_resumed,ads_up, dns_found,dns_krb_tcp_dc_msdcs_up,dns_found,dns_query_ok:mmcesstatecluster:: 0:1::::nas22ces03-i:NFS:HEALTHY:csm_resync_forced,ccr_quorum_nodes_ok, service_running,node_resumed,nfs_dbus_ok,node_resumed,dns_found, dns_found:mmcesstatecluster::0:1::::nas22ces04-i:NFS:HEALTHY:service_running, ccr_quorum_nodes_ok,service_running,node_resumed,nfs_dbus_ok,service_running, node_resumed,dns_found,dns_found:mmcesstatecluster::0:1::::nas22ces05-i:NFS: HEALTHY:service_running,ccr_quorum_nodes_ok,service_running,node_resumed, nfs_dbus_ok,service_running,node_resumed,dns_found,dns_found:mmcesstatecluster:: 0:1::::nas22ces06-i:NFS:HEALTHY:service_running,nfs_exported_fs_chk, nlockmgr_rpcinfo_ok,mountd_rpcinfo_ok,service_running,nfs_rpcinfo_ok,nfsd_up, service_running,dns_found,dns_found

(show details)

Symptom	Incorrect output
Environment	Linux
Trigger	Parsing of the machine readable output was not done correctly.
Workaround	None

5.1.1.1

CES

INFO001

The release of v5.1.1.0 aligned with the release of v5.1.0.3 PTF. Please refer to v5.1.0.3 for list of APARs.

5.1.1.0

INFO