IBM Spectrum Scale APARs Resolved in 5.1.5.x

GPFS daemon dumps errors of type "NUMA mbind failed for pagepool address ..." in the mmfs log. (show details)

Symptom	Error output/message
Environment	Linux
Trigger	Being on a system with more than one numa node.
Workaround	set verbsNumaAffinity=disable

5.1.5.1

RDMA

IJ41553

Part of GPFS are kernel modules that are loaded upon startup and used by other components. Usage counters were not used correctly in the tracedev module, which can lead to the module being unloaded while still in use, resulting in a kernel crash.

One case where this is possible is running the "mmvdisk server configure" and "mmvdisk server unconfigure" commands with the --recycle option.

(show details)

Symptom	Abend/Crash
Environment	Linux
Trigger	Run GPFS shutdown and startup. This is a rare problem, so running this or the mentioned "mmvdisk server" command in a loop will be necessary to trigger the problem.
Workaround	Avoid stopping GPFS immediately after starting up.

5.1.5.1

Core GPFS

IJ42141

mmafmctl prefetch -Y hits segfault (show details)

Symptom	segfault.
Environment	Linux
Trigger	mmafmctl prefetch command with -Y option
Workaround	None

5.1.5.1

AFM

IJ42221

When using mmcrcluster (or likely mmaddnode) with a hostname that resolves to over 64 characters in length, various failures can occur during creation. (show details)

Symptom	Abend/Crash
Environment	All
Trigger	Long hostnames (e.g. from /etc/hosts) during node creation.
Workaround	Force hostnames to be under 64 characters long.

5.1.5.1

CCR

IJ42222

Node designation for pmcollector is not removed after a pod is rescheduled to another node (show details)

Symptom	Unexpected Results/Behavior
Environment	All
Trigger	pmcollector pod gets rescheduled to another node.
Workaround	Restart system health monitoring on the affected core pod with 'mmsysmoncontrol restart'.

5.1.5.1

System Health

IJ42283

A race between recovery group recovery and pdisk state update broadcasting may make pdisk appear as missing which will prevent log group recovery and block I/O. (show details)

Symptom	Stuck IO
Environment	Linux
Trigger	Recovery group and log group failure from too many missing pdisks that could be caused by bad disks, nodes, or network.
Workaround	Restart the gpfs daemons running on the GNR nodes that manage the affected recovery group.

5.1.5.1

ESS/GNR

IJ42294

AFM recovery fails with error 80 due to incorrect checks for the inode attributes. This error causes the replication to be stuck. (show details)

Symptom	Unexpected results
Environment	Linux
Trigger	AFM recovery
Workaround	None

5.1.5.1

AFM

IJ42360

When one CES node gets rebooted, NFS client lock requests might fail with a "NLM_DENIED" error. (show details)

Symptom	Lock request will fail (NLM_DENIED or NLM_BLOCKED error can be seen in tcpdump reply frame of LOCK Request).
Environment	All
Trigger	When one of the protocol nodes of the cluster gets rebooted or a failover happens and a lock request is attempted on the same file.
Workaround	None in NFSv3. Issue not present in NFSv4.

5.1.5.1

NFS-Ganesha

IJ42361

AFM gateway node deadlocks during the read operation if both prefetch and application tries to read the same file simultaneously. (show details)

Symptom	Deadlock
Environment	All
Trigger	Read operation on AFM uncached file.
Workaround	None

5.1.5.1

AFM

IJ42422

If an upgrade is being performed to 5.1.3+ with msgqueue enabled and the IBM Spectrum Scale Knowledge Center instructions are not followed correctly, one can get into a state where 'mmmsgqueue config --remove-msgqueue' can no longer be run to finish the migration off of the msgqueue. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	Upgrading cluster to 5.1.3+ without disabling and/or migrating off of the msgqueue beforehand.
Workaround	None

5.1.5.1

Watch Folder / File audit logging

IJ42423

When one node in an ESS building block is down, its recovery group will attempt to fail over its recovery group to the remaining server in the building block. To ensure consistency of logged writes, it is necessary for the surviving node to read from the logtip backup device, which is an SSD that sits in the external storage enclosure.

If there are problems with the SAS fabric, and there are frequent transient I/O errors over the SAS fabric, a bug in the ESS log tip logic will not retry the read request and will fail to recover the recovery group.

This means that the fail over operation described above will not complete, resulting in a potential outage. The following message will occur (but is not a sufficient condition) in this situation. [E] Unable to read logTipBackup vdisk RG002LOGTIPBACKUP track 0 due to fatal pdisk IO errors!.

(show details)

Symptom	Component Level Outage
Environment	Linux
Trigger	Failing over a recovery group when the SAS fabric is unstable.
Workaround	None

5.1.5.1

ESS/GNR

IJ42485

When an NVMe device is becoming active, it is necessary for ESS to poll the device to determine if it is ready for I/O. It does this by polling the final LBA of the device to see if reads are allowed.

This is because the devices become visible to the OS prior to becoming ready to handle read/write requests. The original implementation, however, would incorrectly claim that media errors on the final LBA mean that the device isn't ready.

As a result, it is possible that legitimate media problems on the final LBA of an NVMe will induce ESS to claim that the entire device is not available. This problem can be identified by an NVMe pdisk going missing after seeing unrecovered read errors in the Spectrum Scale RAID recovery group event log (mmvdisk recoverygroup list --events).

(show details)

Symptom	Component Level Outage
Environment	Linux
Trigger	Corrupted physical block mapped to the final logical block within an NVMe namespace.
Workaround	None

5.1.5.1

ESS/GNR

IJ42496

Below assert going off: logAssertFailed: totalReceived == scatteredP->scattered_total_len || (totalReceived == 0 && scatteredIndex == scatteredP->scattered_count) (show details)

Symptom	Abend/Crash
Environment	All
Trigger	Network is not good which leads to TCP connection reconnect.
Workaround	None

5.1.5.1

Core GPFS

IJ42501

pmsensor GPFSVFSX output 0 read and write stats but there are read/write operations, the problem here is that the format of data provided by mmpmon is not expected by Zimon, which caused the output to be wrong. (show details)

Symptom	Error output/message
Environment	All
Trigger	Read GPFSVFSX stats
Workaround	None

5.1.5.1

perfmon (Zimon)

IJ42615

Some smbtorture tests in vfs.fruit module are failing with errors such as NT_STATUS_OBJECT_NAME_COLLISION, NT_STATUS_INVALID_PARAMETER, bad name. VFS module may unintentionally use filesystem permissions instead of ACL from xattr. (show details)

Symptom	Error output/message
Environment	All
Trigger	This is applicable only for fruit module, i.e. Mac OS clients. Problem can occur when accessing streams in VFS module where vfs_acl_xattr subroutine gets used.
Workaround	None

5.1.5.1

SMB

IJ42724

GEMS trace doesn't print ledState correctly, which might make trouble shooting more difficult. (show details)

Symptom	Error output/message
Environment	Linux
Trigger	None
Workaround	None

5.1.5.1

GNR

IJ40858

ECE can control LED for NVMe drives on some systems where LED control can be performed via sysfs. However, some system might be shipped with mixed drives with and without LED control. ECE disk inventory will break out with error intead of handle them properly. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	Run ECE on an NVMe system with and without LED control.
Workaround	None

5.1.5.1

GNR

IJ42745

mmbuildgpl fails on SLES 15.3, new kernel 5.3.18-150300.59.90-default or later with error as below:

?No rule to make target 'vmlinux', needed by '/usr/lpp/mmfs/src/gpl-linux/kdump-kern-dummy.ko

(show details)

Symptom	mmbuildgpl will fail on SLES 15.3 kernel version 5.3.18-150300.59.90-default or later.
Environment	Linux
Trigger	mmbuildgpl will fail on SLES 15.3 when kernel is upgraded to 5.3.18-150300.59.90-default.
Workaround	lear KBUILD_BUILTIN macro inside /usr/lpp/mmfs/src/gpl-linux/Kbuild KBUILD_BUILTIN := This can be done after below surrounding code #For s390x: -pg and -fomit-frame-pointer are incompatible ifeq ($(ARCH),s390) ifdef CONFIG_FUNCTION_TRACER ORIG_CFLAGS := $(KBUILD_CFLAGS) KBUILD_CFLAGS = $(subst -pg,,$(ORIG_CFLAGS)) endif endif KBUILD_BUILTIN :=

5.1.5.1

Build

IJ42746

logAssertFailed: fileId.inodeNum > 0 when running AFM Recovery or Resync (show details)

Symptom	Lost Membership
Environment	Linux
Trigger	Role Reversal to make old Primary as Secondary and the old Secondary being promoted to Primary.
Workaround	None

5.1.5.1

AFM

IJ42724

GEMS trace doesn't print ledState correctly, which might make trouble shooting more difficult. (show details)

Symptom	Error output/message
Environment	Linux
Trigger	None
Workaround	None

5.1.5.1

GNR

IJ43115

smb2.streams.names2 test in smbtorture failed with error NT_STATUS_CONNECTION_DISCONNECTED (show details)

Symptom	Error output/message
Environment	All
Trigger	This is applicable only for streams module.
Workaround	None

5.1.5.1

SMB

IJ43164

If verbsRdmaSend configuration is enabled, and the verbs connection is disconnected and reconnected due to any error other than node shutdown or node failure, it may cause some RPC reply messages to be left in the internal table unintentionally.

These messages will remain in the internal table forever, as none of ack messages can clean them up. Deadlock will not occur immediately, because these RPC messages have been processed correctly.

However, the problem may occur when the 32-bit message IDs are wrapped and reused. Some new messages may be recognized as duplicated RPCs and be rejected by the destination node. These new messages will stay in 'pending' state and result in deadlock.

(show details)

Symptom	Hang/Deadlock/Unresponsiveness/Long Waiters
Environment	Linux
Trigger	For a cluster which has the verbsRdmaSend configuration enabled, this problem may occur if the verbs connection is disconnected and reconnected due to any error other than node shutdown or node failure (for example because of network issue).
Workaround	Recycle GPFS daemon.

5.1.5.1

RDMA

IJ43166

The systemhealth monitor for GDS (GPUDirect Storage) does not warn if RDMA is not enabled. An enabled RDMA (mmchconfig verbsRdma=enable) is needed as prerequisite to use GDS. (show details)

Symptom	Unexpected Results/Behavior. GDS does not run when RDMA is not enabled.
Environment	All
Trigger	GDS monitoring enabled, but verbsRdma not enabled.
Workaround	None

5.1.5.1

RDMA

IJ43349

The policy rule generated by mmbackup contains an EXTERNAL LIST rule that instructs mmapplypolicy to call BAexecScript which is also generated by mmbackup. One of the options passed to the script was -auditlogname which is a deprecated function. (show details)

Symptom	Component Level Outage
Environment	All
Trigger	This problem occurs if user-defined policy is used with -P option, and the policy contains -auditlogname option, and the Scale version is 5.1.4.0, 5.1.4.1, or 5.1.5.0.
Workaround	Remove "-auditlogname=" option from EXTERNAL LIST rule.

5.1.5.1

mmbackup with -P option

IJ43362

Samba was failing to contact domain controllers with DNS SRV requests. Also Configuring permissions at the SMB share level (with mmsmb exportacl) appears to have no effect on file system operations over SMB. (show details)

Symptom	Error output/message
Environment	All
Trigger	This can happen rarely.
Workaround	None

5.1.5.1

SMB

IJ43416

Incorrect permissions are set when the objects are downloaded using the mmafmcosctl command. (show details)

Symptom	Unexpected results
Environment	Linux
Trigger	Object download with mmafmcosctl command.
Workaround	None

5.1.5.1

AFM

IJ34399

logAssertFailed (lhsP->size == rhsP->size && lhsP->size == resultP->size) When the command mmrestripefs with -b option is running for rebalancing, when adding more dataOnly or metadataOnly disks, the rebalancing threads could hit this disks bitmap operation assert because of the disks bitmap size is increased due to adding disks. (show details)

Symptom	mmfsd daemon process died.
Environment	All
Trigger	Don't add dataOnly or metadataOnly disks to the file system while mmrestripefs -b option command is in progress.
Workaround	Don't add dataOnly or metadataOnly disks to the file system while mmrestripefs -b option command is in progress.

5.1.5.0

mmrestripefs command with -b option

IJ40965

When AFM fileset snapshot is being created or deleted, either through manual snapshot commands or AFM DR periodic snapshot operations, user's file operations in AFM fileset could proceed without interlocking with snapshot operations, then trigger such assert went off. (show details)

Symptom	Daemon crash on AFM gateway node and file system manager node
Environment	All
Trigger	Doing file operations in AFM fileset, and create or delete snapshots for the same AFM fileset at the same time.
Workaround	No, unless disable or stop snapshot operations in AFM fileset.

5.1.5.0

AFM

IJ41620

Running mmbuildgpl on x86_64 with Linux kernels that include fixes for the retbleed vulnerability (CVE-2022-29900) results in an error. As a result, GPFS is not usable with these kernel versions. Specifically, this problem is hit with:

SLES 15 SP3 kernel update 5.3.18-150300.59.87.1 or higher
SLES 15 SP4 kernel update 5.14.21-150400.24.11.1
Ubuntu 22.04 kernel update 5.15.0-45.48

It is expected that the same changes will also be backported to RHEL, but no RHEL kernel updates with retbleed fixes have been released yet. The same applies to Ubuntu 20.04; no kernel updates have been released yet with this changes, but this should happen eventually.

The information provided by the Linux distributions are useful references: https://www.suse.com/security/cve/CVE-2022-29900.html https://ubuntu.com/security/CVE-2022-29900 https://access.redhat.com/security/cve/CVE-2022-29900

(show details)

Symptom	Component Level Outage (GPFS will be unusable on the node).
Environment	Linux (x86_64)
Trigger	This problem occurs when updating the Linux kernel to a version with retbleed patches included.
Workaround	The required change can also be applied manually: Edit the file /usr/lpp/mmfs/src/gpl-linux/Kbuild Around line 100 there is a line: $(KBHOSTPROGS) := lxtrace Before that line, add a new one with: CFLAGS_kdump-kern.o += -mfunction-return=keep Save the file and run mmbuildgpl again.

5.1.5.0

Core GPFS