IBM Storage Scale APARs Resolved in 5.2.3.x

IJ56797

File system creation fails when creating a file system with a file system version at PTF level. For example, issuing the command "scalectl filesystem create -n fs0 -d disk1 --version 5.2.3.4", the creation will error out with the following error: "filesystem creation failed: rpc error: code = InvalidArgument desc = specified file system version is outside the supported range 5.2.3.0-5.2.3.0 for the native REST API" (show details)

Symptom	Error output/message
Environment	All Linux
Trigger	file system creation with a PTF version as an argument
Workaround	Specify "--version 5.2.3.0" instead of a PTF version or the default.

5.2.3.6

Native Rest API

IJ52571

When the mmcrfs command is invoked without the -B (block size) option, and the block size is specified in the pool stanza but not for all data pools, the command creates the missing data pool descriptor using the default 4M data block size. If any existing pool stanza specifies a block size other than the default 4M, the file system will be created with inconsistent data block sizes. This can lead to logAssertFailed: rdwr.C line 1069: checkRangesValue >= 0 (show details)

Symptom	logAssertFailed.
Environment	All
Trigger	Creating a file system with multiple data storage pools that have different block sizes.
Workaround	Add the missing pool stanza to the stanza file or use the -B option when creating the file system.

5.2.3.6

File System

IJ56765

When an object is created in, copied out of with the '-p' flag, or moved into a GPFS file system on AIX, the “extended entries” flag is always turned on. This will incorrectly show an ACL as always having extended entries, regardless of its contents. This will cause 'ls -e' to display a “+” representing the existence of extended entires once the object is moved out of the GPFS file system, and for 'aclget' to always display that extended entries are enabled. (show details)

Symptom	Unexpected Results/Behavior
Environment	AIX only
Trigger	Create an object in a GPFS file system, move an object into a GPFS file system, or copy an object out of a GPFS file system with the '-p' flag. The problem can be observed by running 'aclget' on that object, or moving the object out of the GPFS file system and running 'ls -e'
Workaround	Use 'aclget' to correctly identify existence of extended entries, ignoring “+” in 'ls -e'

5.2.3.6

All Scale Users

IJ57024

When using secure connections (cipherList=AES{128|256}-SHA256), the GPFS daemon may send data while the mutex lock protecting access to the secure connection is not held by the sending thread, resulting in the daemon asserting with the following message, in mmfs.log:
"[X] logAssertFailed: sconnP != __null" (show details)

Symptom	GPFS daemon asserting.
Environment	ALL Operating System environments
Trigger	he problem may occur when secure connections are restarted as a result of spontaneous communication errors.
Workaround	None

5.2.3.6

All Scale Users

IJ57025

It is possible for deadlock to occur after failed to write one replica of a directory block on a file system with metadata replication. (show details)

Symptom	Hang/Deadlock/Unresponsiveness/Long Waiters
Environment	ALL Operating System environments
Trigger	Disk failure causing write to one replica to fail.
Workaround	None

5.2.3.6

All Scale Users

IJ57026

An error condition in the handling of client command RPCs may result in long waiters on the file system manager with reason 'RPC wait' for ccmdMsgSendRcvMessages. (show details)

Symptom	Hang/Deadlock/Unresponsiveness/Long Waiters
Environment	ALL Operating System environments
Trigger	The problem may occur when a client command is terminated prematurely, either explicitly with a CTL-C, or due to an error condition.
Workaround	None

5.2.3.6

All Scale Users

IJ57027

In rare cases, an encrypted file system may panic and get unmounted as a result of unnecessarily checking a directory inode for en encryption context. (show details)

Symptom	Cluster/File System Outage
Environment	ALL Operating System environments
Trigger	Users creating many files in directories in encrypted file systems from many nodes in the cluster, may trigger a special code path that mishandles the accessing of such directories when a node tried to become a metanode for such directories inodes.
Workaround	Disable the stat cache by setting maxStatCache=0

5.2.3.6

Encryption

IJ56781

Setting stat-poll-interval and stat-slot-time to zero does not restore to the automatic adjustments of QoS statistics. (show details)

Symptom	Setting stat-poll-interval and stat-slot-time to zero does not restore to the automatic adjustments of QoS statistics.
Environment	All
Trigger	Setting stat-poll-interval and stat-slot-time to zero does not restore to the automatic
Workaround	Use a null string instead of zero value

5.2.3.6

QoS

IJ52020

Background sync could be blocked while reducing allocation region, this could cause other operations such as create/delete snapshot to be blocked. (show details)

Symptom	Performance Impact/Degradation
Environment	ALL Operating System environments
Trigger	Running applications on a client that require new disk space to be allocated.
Workaround	None

5.2.3.6

All Scale Users

IJ56679

mmfsd hits signal 11 on readSGDesc (show details)

Symptom	Abend/Crash
Environment	All platforms
Trigger	After restart File system manager to break up a deadlock during mmadddisk, some nodes might hit Signal 11, the problem is that the file system manager is processing a more recent SG desc that is read from the disk, before the data structure associated with the new SG descriptor get populated.
Workaround	There is not a work around, scale will crash and restart on itself.

5.2.3.6

All Scale Users

IJ55722

mmaddpdisk --replace failing with error 905 due to stale block device information in PDMaster object. (show details)

Symptom	Error output/message
Environment	Linux Only
Trigger	When device is in replace state, wipe out the drive, and try to add mmaddpdisk --replace
Workaround	Can failover the root LG to a different node, and then you can run this command. But not the correct solution.

5.2.3.6

GNR

IJ57233

If a user with execute permission on a directory performs an operation requiring execute permission (‘cd’ or ‘ls -d’) on that directory, it will then incorrectly grant execute permission to all users for an unspecified amount of time. (show details)

Symptom	Unexpected Results/Behavior
Environment	ALL Operating System environments
Trigger	A directory with any number of extended entries needs to have all extended entries grant execute permission. Then, a user with execute permission on this directory needs to perform an operation requiring execute permission (e.g. 'cd' on the directory or 'ls -d' on a subdirectory). After this, any user will be able to perform an operation on the directory or subdirectory requiring execute permission on said directory for a period of time.
Workaround	Add any extended entry to acl WITHOUT execute permission

5.2.3.6

All Scale Users

IJ57177

Need for a variant of disable-online in afmTarget to disable AFM online without performing a minimum FS level check of 5.1.7.0 required. (show details)

Symptom	Unexpected behavior
Environment	All OS Environments.
Trigger	Trying to disable-online on an FS which is less than 5.1.7.0 level.
Workaround	Upgrade all nodes to at least 5.1.7 or higher and upgrade FS version to higher and then disable-online.

5.2.3.6

AFM

IJ57178

In certain older code levels, commandless disk replacement may not complete successfully. This can occur when the swap bit is not cleared, leading the system to incorrectly treat the disk as still being in a replacement state. As a result, commandless disk replacement operations may fail or require manual intervention. (show details)

Symptom	Hang/Deadlock/Unresponsiveness/Long Waiters
Environment	Linux Only
Trigger	This issue occurs due to a missed condition check. A buffer size mismatch prevented the reset-swap state from being correctly returned to the user CLI. As a result, the swap bit status was not reported accurately per slot.
Workaround	After updating to the latest code, manually verify all drives using tslsenclslot -a. If any drive has the swap bit set to 1, clear it manually using tsctlenclslot.

5.2.3.6

ESS/GNR

IJ57179

No statistics shown when shrink to fit is called before. (show details)

Symptom	None
Environment	ALL
Trigger	Shrink to fit calls
Workaround	None

5.2.3.6

All Scale Users

IJ57180

On a replicated file system with disk in recovering state, it is possible for stale directory data to be read from disk. This can lead to unexpected daemon assert or FSSTRUCT error. (show details)

Symptom	Unexpected Results/Behavior
Environment	ALL Operating System environments
Trigger	Metadata disk going to down state and is being brought back online via mmchdisk start
Workaround	None

5.2.3.6

All Scale Users

IJ57181

Long waiter seen during mount as we have large number of clients with lots of vdisk in place. (show details)

Symptom	Long waiter and long time seen during mount.
Environment	Linux Only
Trigger	This can happen when we have a large number of clients in place with lot of vdisk and we try to add the disk to the file filesystem.
Workaround	None

5.2.3.6

GNR

IJ57182

Failed to download data and metadata of sub-dirs in MU after evicting uploaded directory. (show details)

Symptom	Unexpected Results
Environment	Linux Only
Trigger	1. Create a directory inside MU mode fileset and add some files with data in directory 2. run a reconcile on MU fileset 3. run eviction on mu fileset 4. run download on newly created directory.
Workaround	reset create bit on directory using command mmafmctl fsName resetattr -j fsetName

5.2.3.6

AFM

IJ57183

The system rebooted unexpectedly while a driver was being removed during maintenance due to a timing issue in the driver cleanup process. No data loss occurred. (show details)

Symptom	Upgrade/Install failure
Environment	Linux Only
Trigger	While unloading pemsmod, an interrupt handler ran and tried to read a register using ioread32() after the device / MMIO region was already freed or unmapped. This is a rare timing condition and does not affect normal system operation.
Workaround	Do not manually unload pemsmod.

5.2.3.6

ESS/GNR

IJ57184

Incorrect stats for BytesToWrite, BytesToRead, Used Q-Memory may give wrong indication about writes are stuck, but it is not the case, only stats are incorrect. (show details)

Symptom	Unexpected Stats interpretation
Environment	All OS environments
Trigger	Reset of afm statistics during ongoing IOs.
Workaround	Reset the afm statistics

5.2.3.6

AFM

IJ57185

Moving a file (which is local) out of ptrash or any other local parent directory to a non-local parent, sets a Create and Dirty on file to prompt AFM to replicate it to remote to bring it under AFM purview. If file is uncached, then this can over-write data at the home leaving us no way to get original data. (show details)

Symptom	Unexpected Behavior
Environment	All OS environments
Trigger	Moving an uncached file inside ptrash to a non-local parent directory outside of ptrash.
Workaround	None

5.2.3.6

AFM

IJ57186

There are some extreme rare cases where AFM recovery/resync detects symbolic links that point to NULL targets. Although posix and FS prevents it from explicitly being created, they exists and can block AFM replication fully when queued for Recovery/Resync. (show details)

Symptom	Unexpected Behavior.
Environment	Linux Only
Trigger	Recovery/Resync on a fileset which has NULL Target symlinks.
Workaround	Manually drop the NULL Target symbolic links when encountered and blocking the whole queue.

5.2.3.6

AFM

IJ57187

AFM recovery messages requeued with remote error 22. (show details)

Symptom	Unexpected Results
Environment	Linux Only
Trigger	2. create a symlink to one of the file under fileset path 3. create a hardlink to symlink created. 4. remove symlink 5. start the fileset
Workaround	reset link bit and set create bit on requeued symlink , after this restart fileset it should work as per expectation.

5.2.3.6

AFM

IJ57188

When mmxcp encounters a symlink to a directory, it will copy it as if it is just a normal directory rather than a sym link. (show details)

Symptom	When mmxcp encounters a symlink to a directory, it will copy it as if it is just a normal directory rather than a sym link.
Environment	ALL Linux (mmxcp not supported on AIX).
Trigger	Create a directory. Create a symlink to it. Then try to copy with mmxcp.
Workaround	N/A

5.2.3.6

mmxcp

IJ57189

When afmFastCreate is enabled on fileset, open files are skipped for replication to remote site. This functionality is borrowed from COS backend to minimize replication trials for an open file that is being modified.
2 flaws in this method:
1. If its a priority Q message, then a single open file can block queue forever with error E_REQUEUED.
--> skip checking for open file logic for recovery or resync messages.
2. Extend the afmSyncOpenFiles logic from Object to NFS/NSD backends and for DR Primary mode. (show details)

Symptom	Unexpected behavior.
Environment	Linux Only
Trigger	Fileset with afmFastCreate tunable tuned for AFM
Workaround	Set afmReadSparseThreshold to 0 (implying afmSyncOpenFiles)

5.2.3.6

AFM

IJ57190

An internal call for filesystem status will try to assign a FS Manager which in turn will fail for remote filesystems. This will cause unnecessary work for the local cluster. (show details)

Symptom	Slow IO
Environment	ALL Linux OS environments
Trigger	An internal call for filesystem status will try to assign a FS Manager which in turn will fail for remote filesystems. This will cause unnecessary work for the local cluster.
Workaround	In mid to large clusters avoid the commands "mmperfmon query --list expiredKeys" and "mmperfmon delete --expiredKeys"

5.2.3.6

perfmon (Zimon)
Remote cluster mount/UID remapping

IJ57191

Lookup of a non-existent dir/file with force option doesn't fetch data from home if fileset is in recovery/resync state. (show details)

Symptom	Unexpected behavior
Environment	Linux Only
Trigger	Trigger a force remote lookup when recovery/resync is in progress.
Workaround	Wait for Recovery/Resync to complete to force the remote lookup.

5.2.3.6

AFM

IJ57194

Newly created files may be re-validated with the filesystem causing the lookups (ex. stat) to run slow. (show details)

Symptom	Performance impact
Environment	Linux Only
Trigger	Using AFM LU mode filesets
Workaround	None

5.2.3.6

AFM

IJ57195

AFM may not be able to read all the mappings if the export-map size is greater than 1024 bytes. (show details)

Symptom	Unexpected results
Environment	Linux Only
Trigger	Large export-maps created using the mmafmcosconfig command
Workaround	None

5.2.3.6

AFM

IJ56483

When the number of available file descriptors are exhausted, a logAssertFailed will trigger (show details)

Symptom	Error output/message
Environment	ALL Operating System environments
Trigger	All available file descriptors or sockets exhausted for the process.
Workaround	Restarting the mmfsd daemon.

5.2.3.5

All Scale Users

IJ56498

A race condition can lead to broken secure connections (cipherList set to AES128_SHA256 or AES256_SHA256) during connection reconnect, either when proactive reconnect is enabled or when a network issue triggers a connection disconnect in mmfsd. (show details)

Symptom	Customers may see error messages like the following in the GPFS message log files: [N] Close connection to 10.28.2.47 c81f2u17vm1 <c0n2>:[0] (Unexpected error 22) Unknown error: err = 22, error source = Error sending a message: ERR_SRC_SENDMSG_ERROR, error number = 19
Environment	ALL Operating System environments
Trigger	Proactive reconnect is enabled or network issues triggers a connection disconnect.
Workaround	Disable proactive reconnect (revert to the default setting) and ensure the network connectivity between the GPFS nodes is working correctly.

5.2.3.5

All Scale Users

IJ51071

The mmdefragfs command runs jobs across helper nodes. Each of those jobs can require finding owners of allocation regions. In some extreme cases, overlapping ownership queries can result in deadlocks. (show details)

Symptom	Deadlock
Environment	ALL Operating System environments
Trigger	mmdefragfs is running jobs on multiple nodes. If the ownership of allocation regions is unknown, those must be queried. The logic for tracking ownership dictates that if a regin without a current owner is queried, the node issuing the query must request ownership. If the same regions are queried from multiple nodes, the ownership is unknown at first but changes, this results in overlapping revoke requests, resulting in a deadlock.
Workaround	Workaround could be to not run mmdefragfs or to only run mmdefragfs on a single node.

5.2.3.5

All Scale Users

IJ55679

File system manager node could fail unexpectedly with assert exp((indIndex & 0xFF00000000000000ULL) == 0) in IndDesc.h. This could happen when expanding number of allocated inode on a file system with very high number of allocated inode already. (show details)

Symptom	Abend/Crash
Environment	ALL Operating System environments
Trigger	Increasing number of allocated inode
Workaround	Avoid creating new independent fileset and increasing number of allocated inode.

5.2.3.5

All Scale Users

IJ56253

If a filesystem has quotas enabled and a file is unlinked (its last directory entry removed) before a chown is performed, the chown call will fail with ENOENT, even though the file descriptor remains open and valid. (show details)

Symptom	Unexpected Results/Behavior
Environment	All Operating System environments
Trigger	- Enable quota (file system or fileset level) - Create a file, unlink it while the file descriptor is still valid. - Set ownership for this file descriptor. - Close the file descriptor.
Workaround	Disable quotas

5.2.3.5

Quotas

IJ56485

Signal 11 in IterInodes::getSortedSnapshots() at filesys.h, resulting in a mmfsd crash. (show details)

Symptom	Abend/Crash
Environment	ALL Operating System environments
Trigger	Run batch snapshot workloads.
Workaround	None

5.2.3.5

Snapshots

IJ56486

When mmfsckx finds that it does not have the minimum pagepool needed to run then it will exit. But on exitting it was not deletingthe fsckx snapshot. The reason was that the point of abort was happening before it was able to initialize its internal vector lists which ituses to delete the fsckx snapshot on exit. (show details)

Symptom	Unexpected Results/Behavior
Environment	All
Trigger	mmfsckx exiting due to insufficient pagepool available to run.
Workaround	Use mmdelsnapshot to delete the snapshot.

5.2.3.5

mmfsckx

IJ56487

Changes to the perfmon configurations are not updated on nodes that were down at the time of the change were made. (show details)

Symptom	perfmon configure is not updated.
Environment	Linux Only
Trigger	Perfmon configuration is not updated on nodes that were down.
Workaround	Reissue the mmperfmon command to update the configuration once all nodes are up, or run mmcommon run invokePerfmonctl updateon the perfmon nodes that were down.

5.2.3.5

Perfmon

IJ56491

The sysmon or syslog logs frequently display the following warning messages:
statd_wrong WARNING The rpc.statd process is misconfigured.
These messages indicate that rpc.statd process is not configured properly, (show details)

Symptom	statd_multiple event in cluster
Environment	Linux Only
Trigger	This typically occurs because rpc.statd may spawn short-lived child processes as part of its normal operation. These child processes can briefly appear as multiple running instances. Additionally, sysmon performs health checks during the exists of these child processes, which can lead to the system reporting warnings about rpc.statd process is misconfigured, even though this behavior is expected.
Workaround	None

5.2.3.5

NFS

IJ56619

When running AIO, the thread submitting the I/O request is not the same as the one completing the I/O request. There is race condition where an AIO request that is quickly completed is still accessed from the submitting threads. This either results in a kernel KFENCE warning or a node crash. (show details)

Symptom	Abend/Crash
Environment	ALL Linux OS environments
Trigger	Run AIO in a way that the requests a completed very quickly. This is workload dependend and might be hard to recreate.
Workaround	There is no workaround, the fix is required to avoid this problem.

5.2.3.5

All Scale Users

IJ56620

When mmfsck detects a hole in a reserved file, it fills the hole by allocating a new disk address and adding that address to the file’s indirect block. It also updates its internal block allocation bitmap to mark the new block as in-use.
However, the internal block allocation bitmap is distributed across the scanning nodes. If the newly allocated block falls outside the region of the bitmap owned by the node that performed the allocation, the node may skip updating the bitmap. As a result, the block remains unmarked in the bitmap. This leads mmfsck to falsely later identify the block as lost. In repair mode, it then incorrectly marks the block as free. Later, when the file system is in use, it may reallocate this block to another file, resulting in duplicate block corruption. (show details)

Symptom	Operation failure due to FS corruption and SGPanic
Environment	ALL Operating System environments
Trigger	This issue can happen when mmfsck detects and repair holes in reserved files.
Workaround	Run mmfsck in repair mode (-y) again after the first repair run.

5.2.3.5

FSCK

IJ56698

GPFS daemon could fail unexpectedly with assert: exp(emptySduGroupCommit() || isSGPanicked). This could happen if disk error causes read/write of directory blockto fail. (show details)

Symptom	Abend/Crash
Environment	ALL Operating System environments
Trigger	Disk failure
Workaround	None

5.2.3.5

All Scale Users

IJ56699

Upon installation, the gpfs.scaleapi rpm changes the ownership of directories and files under /var/mmfs from root to scaleapiadmd. It excludes files under/var/mmfs/ssl/keyServ, but not the RKM.conf file located in /var/mmfs/etc. Directories and files of file systems mounted under /var/mmfs are alsoaffected. If RKM.conf has incorrect permissions or ownership, encryption may fail. (show details)

Symptom	- Incorrect permission/owner of files under other file systems mounted under /var/mmfs - mmkeyserv command failure - Incorrect permissions for the configuration file /var/mmfs/ssl/keyServ/RKM.conf in mmfs.log
Environment	Linux only
Trigger	Installation of the gpfs.scaleapi rpm package, mmkeyserv command failure. Incorrect permission/owner of files under other file systems mounted under/var/mmfs
Workaround	Manually change the ownership from scaleapiadmd/scaleapiadm back to root/root

5.2.3.5

AFM, encryption

When applications simultaneously use a writable, shared memory map (mmap) and perform regular write()/pwrite()operations to the same file that is subject to snapshot Copy-on-Write (COW), the file system can hit a three-way deadlock. The cycle involves:
• a page-faulting mmap reader that triggers COW into a previous snapshot,
• a concurrent VMA change (e.g., mremap/munmap) that requires the kernel's mmap write semaphore,
• and a regular write path that holds the inode write lock and then page-faults on its user buffer (which also needs the mmap semaphore).
Once formed, the cycle blocks progress on the affected file and can ultimately lead to automatic deadlock breakup (filesystem panic/unmount) depending on configuration. (show details)

Symptom	•Application threads or system threads hang on file I/O to the affected file. •Trace/logs show CopyDataOnWriteHandlerThread waiting on inode rf, a writer holding wa and blocked in a page fault, and a VMA operation holding/waiting the mmap write semaphore. •With deadlock breakup enabled, Scale may log multi-phase “deadlock breakup” and unmount/panic the impacted filesystem.
Environment	All supported OS environments.
Trigger	This issue affects customers that: •Use writable, shared mmap on files that may require snapshot COW, and •Perform regular write()/pwrite() to the same file, and •Occasionally execute VMA-altering operations such as mremap/munmap on the mapping. A deadlock can occur when: •An mmap page fault (“PF reader”) triggers CopyDataOnWrite for a prior snapshot and needs the inode rf lock. •A concurrent writer holds the inode wa lock, then page-faults on its user buffer and must acquire the mmap semaphore. •A concurrent mremap/munmap seeks the mmap semaphore as writer, blocking page-fault progress.This forms a cycle (PF reader ↔ writer ↔ mremap) that stalls I/O on the file.
Workaround	•Avoid concurrent VMA changes (mremap/munmap) while a file is actively accessed via writable shared mmap and regular writes on a snapshot-eligible file. •Where feasible, separate write bursts from mmap page-fault activity on the same file, or map readers MAP_PRIVATEif application semantics allow. (These are operational mitigations only; they do not fully prevent the issue.)

5.2.3.5

GPFS/Scale — mmap, snapshot Copy-on-Write, locking.

IJ56680

mmbackup verifies directory size as one of the triggers to select objects to be sent to IBM Storage Protect Server. Since directory size will be calculated during restore, if only size is different, no need to re-backup the directory. Hence, mmbackup will not verify size during during backup candidate selection process if the object is directory. (show details)

Symptom	mmbackup may select unchanged directories as backup candidates
Environment	ALL OS that supports mmbackup
Trigger	run live fs backup and then run snapshot backup
Workaround	none

5.2.3.5

mmbackup

IJ56142

With workloads that heavily lookup or traverse symlinks, contention can occur inside GPFS. The problem is that every symlink lookup request from an application results in the symlink target being queried from the file system, resulting in possible contention on internal locks. (show details)

Symptom	Performance Impact/Degradation
Environment	ALL Linux OS environments
Trigger	The problem is caused by heavily concurrent lookups of the same symlink by many threads.
Workaround	There is no workaround.

5.2.3.5

All Scale Users

IJ56488

A hang can occur when three operations hit the same file at once:
a process touches a shared, writable mmap mapping and faults a page,
another thread/process performs mremap (needing the mmap write semaphore), and a concurrent write()/pwrite() to the same region.
Under certain timing, the page-fault path must fetch a file lock from the daemon, while the writer is also fetching a conflicting lock. The result is a lock/semaphore cycle between the page-fault handler, the writer, and mremap, and I/O to that file can stall indefinitely.
(show details)

Symptom	Threads hang in file operations; GPFS traces show the mmap page-fault path waiting on a fetched lock, a writer stuck on the mmap semaphore after initiating a daemon fetch, and mremap waiting for the semaphore upgrade. No progress until GPFS services are restarted. Fix description (high level): Extend the existing mmap uXfer (“borrowed lock”) fast-path into the daemon fetch path. When the kernel’s lock attempt requires a fetch, the daemon can—under safe conditions—temporarily “borrow” a read lock for the page-fault request and signal the kernel to proceed, breaking the cycle while preserving correctness. (Normal lock/token ownership is finalized once the fetch completes; error paths are handled so the kernel falls back safely if borrowing isn't possible.)
Environment	ALL OS environments
Trigger	File is mmap'd MAP_SHARED\|PROT_WRITE (or read-only with faults against the same region) while a concurrent write()/pwrite() targets the same range. A mremap occurs concurrently, contending on the mmap semaphore. Lock acquisition in the kernel returns E_NEED_FETCH and both the page-fault path and the writer rely on the daemon to fetch/upgrade the inode lock; specific timing can create a cyclic wait.
Workaround	None practical. (Avoiding concurrent mmap access and mremap/writes to the same region prevents the issue but is often not feasible.)

5.2.3.5

All Scale Users

IJ56490

Creating a new storage pool with allowWriteAffinity set to YES could lead to unexpected FSSTRUCT error if only a single small disk is used during storage pool creation. (show details)

Symptom	Error output/message
Environment	ALL Operating System environments
Trigger	Creating a new storage pool with allowWriteAffinity set to YES
Workaround	Add multiple disks when create a new storage pool with allowWriteAffinity set to YES.

5.2.3.5

All Scale Users

IJ56734

When reading from snapshot files, applications may encounter unexpected non-zero data in blocks that were never written to in the original (root) filesystem. These blocks were part of a pre-allocated file but remained uninitialized, and therefore should logically contain zeros. The error occurs because the snapshot exposes raw, uninitialized disk contents—garbage data—at these locations. This issue is specific to snapshots and does not occur when reading from the root filesystem, where such blocks are correctly interpreted as zero. (show details)

Symptom	Unexpected Results/Behavior
Environment	ALL Operating System environments
Trigger	If both the snapshot and the root filesystem contain a block that was pre-allocated but never written to, reading from the snapshot may return uninitialized data ("garbage") instead of zeroes.
Workaround	None

5.2.3.5

Snapshots

IJ56735

Ganesha crashed at nfs3_create()-> fsal_internal_close()
(gdb) bt
#0 0x00007fcff18b3bbf in raise ()
#1 0x00007fcff37bd193 in crash_handler
#2 <signal handler called>
#3 0x00007fcff10f752f in raise ()
#4 0x00007fcff10cae65 in abort ()
#5 0x00007fcff10cad39 in __assert_fail_base.cold.0 ()
#6 0x00007fcff10efe86 in __assert_fail ()
#7 0x00007fcfebccd6ec in fsal_internal_close (fd=0, owner=0x0, cflags=0)
#8 0x00007fcfebcc240c in gpfs_reopen_func (obj_hdl=0x7fcecc387e30, openflags=67, fsal_fd=0x7fcecc387ee0)
#9 0x00007fcfebcc2954 in open_by_handle (obj_hdl=0x7fcecc387e30, state=0x0, openflags=67, createmode=FSAL_UNCHECKED, verifier=0x7fcf3f7b5e4c "", attrs_out=0x7fcf3f7b5a70)
#10 0x00007fcfebcc3322 in gpfs_open2 (obj_hdl=0x7fcfce7a2080, state=0x0, openflags=67, createmode=FSAL_UNCHECKED, name=0x7fcecc3d5400 "version.txt", attr_set=0x7fcf3f7b5f60, verifier=0x7fcf3f7b5e4c "", new_obj=0x7fcf3f7b5b70, attrs_out=0x7fcf3f7b5a70, caller_perm_check=0x7fcf3f7b5c2f)
#11 0x00007fcff38c6718 in mdcache_open2 (obj_hdl=0x7fcfdad0a8a8, state=0x0, openflags=67, createmode=FSAL_UNCHECKED, name=0x7fcecc3d5400 "version.txt", attrs_in=0x7fcf3f7b5f60, verifier=0x7fcf3f7b5e4c "", new_obj=0x7fcf3f7b6080, attrs_out=0x7fcf3f7b5e60, caller_perm_check=0x7fcf3f7b5c2f)
#12 0x00007fcff37915b6 in open2_by_name (in_obj=0x7fcfdad0a8a8, state=0x0, openflags=67, createmode=FSAL_UNCHECKED, name=0x7fcecc3d5400 "version.txt", attr=0x7fcf3f7b5f60, verifier=0x7fcf3f7b5e4c "", obj=0x7fcf3f7b6080, attrs_out=0x7fcf3f7b5e60)
#13 0x00007fcff3793e7d in fsal_open2 (in_obj=0x7fcfdad0a8a8, state=0x0, openflags=67, createmode=FSAL_UNCHECKED, name=0x7fcecc3d5400 "version.txt", attr=0x7fcf3f7b5f60, verifier=0x7fcf3f7b5e4c "", obj=0x7fcf3f7b6080, attrs_out=0x7fcf3f7b5e60)
#14 0x00007fcff38a6f36 in nfs3_create (arg=0x7fcecca10db0, req=0x7fcecca10580, res=0x7fceccaa03b0)
#15 0x00007fcff37b81c7 in nfs_rpc_process_request (reqdata=0x7fcecca10580, retry=false)
#16 0x00007fcff37b8764 in nfs_rpc_valid_NFS (req=0x7fcecca10580)
#17 0x00007fcff352deee in svc_vc_decode (req=0x7fcecca10580)
#18 0x00007fcff352911a in svc_request (xprt=0x7fcfc536cd70, xdrs=0x7fcfcd737830)
#19 0x00007fcff352ddf3 in svc_vc_recv (xprt=0x7fcfc536cd70)
#20 0x00007fcff352909a in svc_rqst_xprt_task_recv (wpe=0x7fcfc536d038)
#21 0x00007fcff35355bd in work_pool_thread (arg=0x7fcfca4779d0)
#22 0x00007fcff18a91ca in start_thread ()
#23 0x00007fcff10e28d3 in clone () (show details)

Symptom	Crash
Environment	Linux Only
Trigger	This issue affects only NFSv3. A crash in the nfs-ganesha service may occur when an NFS client concurrently performs a file creation operation and attempts to fix a broken symbolic link pointing to the same file.
Workaround	None

5.2.3.5

NFS-Ganesha

IJ56736

Starting in 5.2.3.0 gpfs.base required openssl libraries, specifically for mmfsd. Although the binary required the libraries as some symbols were defined, they were unused. This was introduced with the release of the IBM Storage Scale native REST API feature. All communication from scaleadmd to mmfsd is done over a local Unix Domain Socket and ssl is not in use. (show details)

Symptom	Installs package that is required, but unused
Environment	Linux Only
Trigger	Install gpfs.base
Workaround	None

5.2.3.5

Linux Only

IJ56737

In version 5.2.3, the gpfs.snap command collects a listing of all files and directories under /var/mmfs. If AFM mount target paths or large filesystems are mounted under /var/mmfs, the command can take an extended amount of time or appear to hang. (show details)

Symptom	The command appears to hang or takes an unusually long time to complete.
Environment	All
Trigger	Running gpfs.snap on systems with large AFM mount targets.
Workaround	remote AFM targets or other file systems under /var/mmfs before invoking the gpfs.snap command.

5.2.3.5

AFM, gpfs.snap command

IJ56738

GPFS daemon thread VdiskMetadataWorkerThread is deadlocked with 'wait for GNR buffers from steal thread' (show details)

Symptom	Deamon deadlock
Environment	Linux Only
Trigger	Heavy workload with limited amount of buffers.
Workaround	None

5.2.3.5

GNR

IJ56747

The sysmon or syslog logs frequently display the following warning messages:
[I] Event raised: The rpc.statd process is running.
[W] Event raised: The rpc.statd process is running multiple times.
These messages indicate that multiple instances of the rpc.statd process are being detected. (show details)

Symptom	statd_multiple event in cluster
Environment	Linux Only
Trigger	This typically occurs because rpc.statd may spawn short-lived child processes as part of its normal operation. These child processes can briefly appear as multiple running instances. Additionally, sysmon performs health checks during the exit of these child processes, which can lead to the system reporting warnings about multiple instances of rpc.statd running, even though this behavior is expected.
Workaround	None

5.2.3.5

NFS

IJ55302

Reading of compressed file could fail unexpectedly with E_INVAL error. This could happen when reading last block of a compressed file. (show details)

Symptom	IO error
Environment	ALL Operating System environments
Trigger	Concurrent read of compressed file on the same node
Workaround	Avoid concurrent read of compressed file on the same node

5.2.3.4

GPFS Native Compression

IJ55097

Storage Scale's file system encryption functionality does not allow the use of user-provided certificates that do not strictly conform to RFC 5280. This fix allows the use of certificates with policies that do not conform to RFC 5280. (show details)

Symptom	The simplified setup for file system encryption will fail; or when using regular setup, retrieving keys from the key server may fail.
Environment	All
Trigger	The problem is triggered by non-RFC 5280-compliant certificates.
Workaround	Use certificates that conform to the RFC 5280 specification.

5.2.3.4

Encryption

IJ55699

Resync is not able to fix the acls for few dirs where acls were set before enabling AFM at home. (show details)

Symptom	Acls are not in sync
Environment	All Linux OS environments
Trigger	AFM caching without enabling AFM at home and setting acls on dirs.
Workaround	None

5.2.3.4

AFM

IJ55700

Recovery is not able to create a hardling to softlink and failing with error 22. (show details)

Symptom	Recovery is not progressing with Link operation in queue.
Environment	All Linux OS environments
Trigger	AFM caching in recovery with hardlink operation in queue which is created for softlink file.
Workaround	None

5.2.3.4

AFM

IJ55701

mmfsck and mmfsckx cannot detect and repair if there is corruption in directory such that it has CDITTOs in them. (show details)

Symptom	FSSTRUCTs
Environment	All
Trigger	Unknown
Workaround	None

5.2.3.4

mmfsck and mmfsckx

IJ55287

Scale 5.2.2 and newer can leak kernel memory when running mmap workloads. (show details)

Symptom	Leak of kernel memory.
Environment	ALL Linux OS environments
Trigger	Running mmap workloads where GPFS has to handle many mmap writeback requests can hit a codepath where an asynchronous queue is full, and a synchronous fallback codepath allocates memory without freeing it. This is more likely hit with heavy mmap workloads that map large file ranges and then only issue partial writes to the mapped area (e.g. write one page, skip one page, write one page, etc.). That results in many small mmap write requests in GPFS, making it more likely to fill the asynchronous queue.
Workaround	Reducing the mmap workload might reduce the risk of the leak, but there is no guarantee.

5.2.3.4

All Scale Users

IJ55169

Typically mmchmgr should not be run while mmfsck is in-progress. But this was not being handled in the code and that led to long waiters as mmchmgr had to wait for mmfsck to complete and this caused long waiters and other commands to queue up behind mmchmgr. (show details)

Symptom	Hang/Deadlock/Unresponsiveness/Long Waiters
Environment	ALL Operating System environments
Trigger	Run mmchmgr while mmfsck is in progress.
Workaround	Do not run mmchmgr while mmfsck is running.

5.2.3.4

mmchmgr

IJ55719

GPFS failed the assertion "(quotaFlags & QUOTA_UNKNOWN_FLAGS) == 0" while extending the file audit log file in a quota enabled file system. (show details)

Symptom	GPFS daemon fails with the subject assertion.
Environment	ALL Operating System environments
Trigger	When extending audit log file, the quotaFlags were not initialized leading to assertion.
Workaround	None

5.2.3.4

File audit logging and quotas

IJ55709

Customers will see errors for mmuserauth create if AD password starts with hyphen and given from command line. (show details)

Symptom	Error output/message
Environment	Linux Only
Trigger	Customer having AD password starting with hyphen and trying to create mmuserauth from command line.
Workaround	Customer can pass a password that starts with hyphen to mmuserauth using password file.

5.2.3.4

CES

IJ55720

With simplified setup for file system encryption, when the KMIP client and server certificates are signed by the CAcertificate chains that have common certificates (e.g., same CA root and possibly intermediate certificates), the'mmkeyserv client register' command fails with error 71, as a result of the GKLM server returning a generic errorcode--instead of the expected one--in its response to the mmkeyserv command (show details)

Symptom	Failure to register a mmkeyserv client.
Environment	AIX, Linux
Trigger	The use of KMIP client and server certificates signed by CA certificates chains with shared certificates.
Workaround	Either (1) use self-signed, system generated KMIP client certificates; or (2) invoke the 'mmkeyserv client register'command again (the GKLM server will return the expected return code in its response to the mmkeyser command).

5.2.3.4

GPFS Core

IJ55569

In rare cases, an encrypted file system may panic and get unmounted as a result of unnecessarily checking a directory inode for en encryption context. (show details)

Symptom	Cluster/File System Outage
Environment	ALL Operating System environments
Trigger	Users creating many files in directories in encrypted file systems from many nodes in the cluster, may trigger a special code path that mishandles the accessing of such directories when a node tried to become a metanode for such directories inodes.
Workaround	Disable the stat cache by setting maxStatCache=0

5.2.3.4

Encryption

IJ55721

Daemon can hit a logAssert, resulting in the daemon recycle. (show details)

Symptom	frequent inode space expansion messages in mmfs.log
Environment	ALL Operating System environments
Trigger	A file create workload that results in inode space expansion can trigger the problem.
Workaround	perform manual inode-space expansion of all the inode spaces.

5.2.3.4

File creation/Inode allocation.

IJ55865

The security advisory RHSA-2025:15668 by Red Hat for RHEL 9.4 includes a kernel upgrade to kernel version5.14.0-427.88.1.el9_4. This update introduces an incompatibility with IBM Storage Scale's kernel modulesfor existing IBM Storage Scale levels and causes a node to crash on the startup of Scale. (show details)

Symptom	On startup, IBM Storage Scale causes an access violation in the kernel that results in a node crash and reboot.
Environment	All Linux platforms supporting Scale on RHEL 9.4: x86_64, s390, ppc64le, aarch64
Trigger	The issue is hit when the RHEL 9.4 kernel is upgraded to 5.14.0-427.88.1.el9_4 and IBM Storage Scale is nothaving the latest fix level.
Workaround	There is no workaround available other than either installing an efix or reverting the kernel to a supported level < 5.14.0-427.88.1.el9_4.

5.2.3.4

Scale as a whole - node crashes on startup.

IJ55891

GPFS daemon assert going off:Assert exp(getInfoGeoExtensionLen>0),at 8941 nsdDiskConfig.C, resulting GPFS daemon crash. (show details)

Symptom	Assert
Environment	Linux Only
Trigger	This is triggered by a race condition between thread which fetches the extended attributes nsdMsgGetInfoX and a Resign thread which tries to close the NSD results in the Disk Geometry pointer geoP to get set to NULL whichcauses the Assert.
Workaround	None

5.2.3.4

GNR

IJ55285

File audit log will start leaking memory, leading to mmfsd causing an OOM event (show details)

Symptom	abend/crash when oom
Environment	Linux Only
Trigger	in order for the memory leak to happen, the customer has to intentionally remove audit logs that are still being used by file audit logging
Workaround	prevent deleting audit logs that are still being appended to by file audit logging

5.2.3.4

file audit logging

IJ54571

Race between token revoke and buffer steal could lead to GPFS daemon to fail with signal 11 or assert. It could also lead to unexpected FSSTRUCT error to be issued. (show details)

Symptom	Abend/Crash
Environment	ALL Operating System environments
Trigger	Concurrent operation on same directory from multiple nodes
Workaround	None

5.2.3.4

All Scale Users

IJ55893

When running IOR Hard Write on hundreds of client nodes, with data shipping enabled, theserver nodes may get into such long waiters at the end of the run, preventing the application from ending. (show details)

Symptom	Long Waiters
Environment	All platforms
Trigger	Lots of client nodes used in an IOR Hard Write run and with data shipping enabled
Workaround	None

5.2.3.4

Data Shipping

IJ55894

On Linux system, upgrading Storage Scale to a newer version could partially fail if the GPFS kernel module was not completely unloaded resulting in the node having different versions of gpfs.base and gpfs.gpl packages. In such case, the mmbuildgpl command can still succeed which results in unexpected behavior. The mmbuildgpl command must check for the same version of gpfs.base and gpfs.gpl packages before proceeding with the build. (show details)

Symptom	Unexpected Results/Behavior
Environment	ALL Linux OS environments
Trigger	Upgrade Scale to a new version while the kernel module is not completely unloaded and then run mmbuildgpl to build the portability layer of the newly installed version.
Workaround	Install the correct version of gpfs.base and gpfs.gpl packages.

5.2.3.4

All Scale Users

IJ56314

Cluster quorum loss may occur when re (show details)

Symptom	Cluster shuts down
Environment	All
Trigger	non-quorum ESS BB server reporting communication problems with quorum node partern.
Workaround	None

5.2.3.4

GPFS Core

IJ55680

When client nodes leaves or rejoins the cluster, it is causing lock contention due to unnecessary server disk operations. (show details)

Symptom	Performance degradation.
Environment	Linux Only
Trigger	client nodes leaves or rejoins the cluster in a large cluster.
Workaround	None

5.2.3.4

GNR/NSD

IJ56058

Even when application is closing the files created with O_TMPFILE flag, these files do not get cleaned up from Scale cache. As a result, when cache goes beyond “maxFilesToCache” value, AsyncStealWorkerThread is triggered to clean up entries from cache. But since these O_TMPFILEs still have a valid VFS reference, this thread is unable to remove them from the cache. This causes AsyncStealWorkerThread to run continuously, causing mmfsd CPU usage to go up. Customer may always end up seeing “AsyncStealWorkerThread” in the ‘mmdiag —waiters” output. (show details)

Symptom	Performance Impact/Degradation
Environment	ALL Operating System environments
Trigger	Creating lots of files with O_TMPFILE flag that exceeds Scale cache limit (as it is described above in problem description)
Workaround	Clean dentry cache using command- "echo 3 > /proc/sys/vm/drop_caches"

5.2.3.4

Core

IJ56059

mmsysmon daemon wrongly computes length of multibyte unicode chars, e.g. "Münster" as city name.
This leads to receiving more bytes than expected per mmsysmon UDS protocol if any non-ASCII chars are being sent.
This leads to e.g. SSS call home failover not working at all if non-ASCII chars are a part of call home config.
In mmhealth this could lead to inconsistently reproducible errors if any Scale entities (e.g. fileset names) are using non-ASCII characters. (show details)

Symptom	Component Level Outage
Environment	ALL Linux OS environments
Trigger	Using non-ASCII characters for any settings of Scale.
Workaround	Not using non-ASCII characters for any settings of Scale.

5.2.3.4

Callhome, System Health

IJ56060

When applications use both mmap and Direct I/O (DIO) on the same file concurrently, a deadlock can occur due to conflicting byte-range locks (brLock) taken by the two access paths. Once the deadlock occurs, I/O operations on the affected file hang indefinitely and cannot be recovered without restarting GPFS services. (show details)

Symptom	Applications or system threads hang on file operations. GPFS trace logs show page fault handlers, DIO threads, and brLock waiting on each other in a cycle. No progress is made until GPFS services are restarted.
Environment	ALL OS environments
Trigger	This issue affects customers whose applications mix use of memory-mapped I/O (mmap) and Direct I/O (O_DIRECT) to the same file. The problem occurs when: An application opens a file with mmap (shared, writable) and accesses pages through normal memory operations. At the same time, another process or thread issues Direct I/O operations (O_DIRECT reads or writes) to the same file. Both access paths attempt to lock overlapping byte ranges in the file, resulting in a deadlock. Once the deadlock occurs, all further I/O to the affected file hangs indefinitely until GPFS is restarted.
Workaround	None

5.2.3.4

All Scale Users

IJ56063

QoS statistics reporting did not correctly handle the --seconds parameter. Instead of displaying statistics for the specified number of seconds relative to the current clock time, it was showing data relative to the most recent statistics cached in daemon memory. (show details)

Symptom	Unexpected Results/Behavior
Environment	ALL Operating System environments.
Trigger	Displaying QoS statistics after a period of inactivity.
Workaround	None

5.2.3.4

QoS.

IJ56143

Async IO for restricted cases, requiring a fallback to compat mode are not possible with GDS resulting in an error for the IO request. (show details)

Symptom	error
Environment	Linux Only
Trigger	Trigger an async IO for restricted cases from a GDS client, e.g.: replicated files
Workaround	None

5.2.3.4

GDS

IJ56144

1. dir-list-file prefetch should spill out bad directories to failed list when --enable-failed-file-list is passed.
2. Prefetch determines list file to be home-list-file because of logical comparison failure. (show details)

Symptom	Unexpected behavior.
Environment	All OS environments.
Trigger	1. Dir-list-file prefetch with wrong directories in the list. 2. A case of prefetch using list file where cache and homepath are exactly the same.
Workaround	None

5.2.3.4

AFM

IJ51457

When running file audit logging, audit events are written to Audit Log Files in the Audit Log fileset (.audit_log, by default). If these files are compressed while file audit logging is actively writing to them, the active files in Audit Log that are being written to can become corrupted and unrecoverable. Audit files are compressed automatically after mmfsd is done writing to them, and the audit records are not intended to be compressed before mmfsd is done writing to them. Because file system activity is usually present all the time, it is likely that there will be an active audit log that is being written to on each node at any point in time. (show details)

Symptom	corruption
Environment	Linux Only
Trigger	A “file system struct error” will be triggered by the issue, and compression or decompression will fail against the affected audit records. The /var/log/messages file (or the output of the errpt command on AIX) might contain an entry similar to the following: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=12662454: Invalid disk data structure. Error code 113
Workaround	Do not compress audit logs manually

5.2.3.4

File Audit Logging

IJ56160

File system manager node could fail unexpectedly with assert exp((indIndex & 0xFF00000000000000ULL) == 0) in IndDesc.h.
This could happen when expanding number of allocated inode on a file system with very high number of allocated inode already. (show details)

Symptom	Abend/Crash
Environment	ALL Operating System environments
Trigger	Increasing number of allocated inode
Workaround	Avoid creating new independent fileset and increasing number of allocated inode.

5.2.3.4

All Scale Users

IJ56251

When submitting an aio request, a data structure is still accessed after queueing and potentially completing the aio request. That results in access of already freed memory. This goes unnoticed in many cases, unless the workload is very high and the freed memory is immediately reused. In that case, it will result in a kernel crash. (show details)

Symptom	Abend/Crash
Environment	ALL Linux OS environments
Trigger	This problem is hit when running Scale 5.2.3 or higher with a high aio workload. This happens during "goodpath" I/O, no error conditions need occur.
Workaround	The problem has been introduced in Scale 5.2.3. Unless the fix is applied, one way to avoid this problem is to stay on a Scale level lower than 5.2.3. Reducing the I/O workload might avoid this problem, but this cannot be guaranteed.

5.2.3.4

All Scale Users

IJ56313

mmfsck and mmfsckx cannot detect and repair if there is corruption in directory such that it has CDITTOs in them. (show details)

Symptom	FSSTRUCTs
Environment	All
Trigger	Unknown
Workaround	None

5.2.3.4

mmfsck and mmfsckx

IJ54955

GPFS daemon could fail with assert unexpectedly during file repair. This could happen when there is a race between file repair and indirect block updates on the same file. (show details)

Symptom	Abend/Crash
Environment	ALL Operating System environments
Trigger	Run file repair via mmrestripefile, mmrestripefs, mmchdisk, etc
Workaround	None

5.2.3.4

All Scale Users

IJ55915

An SGPanic() could be triggered various reasons. Some of the reasons are:
- when a node failed internally and left the cluster
- when metadata writes fails (with err =5, EIO) or hit OOS (Out-Of-Space) condition
- other critical failures
As part of S.311931, there were changes made in the StripeGroup::handleForceUnmount().The changes were to prevent the file system from being remounted when SGPanic() is triggered for OOS (Out-Of-Space) condition but had side-effect (which is presented in this defect). (show details)

Symptom	File system is not being remounted upon SGPanic()
Environment	Linux and AIX
Trigger	SGPanic()
Workaround	A user mount the file system manually with mmmount command.

5.2.3.4

file system mount and {re,un}mount

IJ55192

In rare cases, an encrypted file system may panic and get unmounted as a result of a directory inode being .... (show details)

Symptom	Cluster/File System Outage
Environment	ALL Operating System environments
Trigger	Users creating many files in directories in encrypted file systems from many nodes in the cluster, may trigger a special code path that mishandles the accessing of such directories when a node tried to become a metanode for such directories inodes.
Workaround	Disable the stat cache by setting maxStatCache=0

5.2.3.3

Encryption

IJ54956

During file sharing, missing node information led to a crash. (is the problem description too loose? its almost the same as the title, but less technical) (show details)

Symptom	Abend/Crash
Environment	ALL Linux OS environments
Trigger	The file accessed from a remote cluster has an access control list.
Workaround	Adding information to the communication context about the nodes grant access and access the files.

5.2.3.3

Remote cluster mount/UID remapping

IJ54792

Unable to add new disk with thin provisioning when attempted with running mmadddisk. The command failed with 'Disk 'PRD_ABITEST14_01' mismatch, it doesn't support 'UNMAP' to reclaim space.'. (show details)

Symptom	The command will fail with the error message like, "Disk 'PRD_ABITEST14_08' mismatch, it is not allowed to have both thin and non-thin disks in the system pool.".
Environment	Linux Only
Trigger	Run mmadddisk with a stanza file (contains a (list of) NSD(s). The stanza file would have thinDiskType={scsi \| auto}. If the sysfs attribute (rotational, which is typically located at the queue (ex, /sys/devices/virtual/block/dm-1/queue/rotational)) for the disk(s) is(are) 0 indicating SSD, a different thinDiskType (i.e., nvme) will be returned which is a 'mismatch'.
Workaround	None

5.2.3.3

thin-provisioning.

IJ54254

Lookup on a directory could stuck in a endless loop if there is a directory block with corruption. (show details)

Symptom	Hang/Deadlock/Unresponsiveness/Long Waiters
Environment	ALL Operating System environments
Trigger	Lookup on a directory with corrupted block
Workaround	Run offline mmfsck on the file system to repair any directory corruption.

5.2.3.3

All Scale Users

IJ55321

When a Direct IO is performed on an AFM uncached file the DIO path skips AFM caching needed path if dio is desired. This causes Data to appear corrupted (all 0s). (show details)

Symptom	Data Corruption
Environment	All OS Environments
Trigger	Direct IO Read on an AFM uncached file.
Workaround	mmchconfig dioDisable=1 -i

5.2.3.3

AFM

IJ55350

Filesystem level migration, or any multiple Fileset(s) filesystem needs checkDirty, checkUncached to be able to run at the FS level for complete checks. Also need an enhancement to mention -s similar to -g for all policy invocations of AFM. (show details)

Symptom	Unexpected Behavior
Environment	All OS Environments
Trigger	Running mmafmctl checkDirty and checkUncached commands at the Filesystem level where AFM filesets are present.
Workaround	Running at individual fileset level only.

5.2.3.3

AFM

IJ54084

In a file system configured with the (default) "relatime" setting, if nodes only read files (but not write to them), while others stat those files, stat() will not provide an updated value for atime. This will affect applications that count on updated atime to determine whether files have been accessed recently. (show details)

Symptom	Unexpected Results/Behavior
Environment	ALL Operating System environments
Trigger	The file system is created with the -S set to its default value ("relatime"). Applications read the content of files, but seldom write to them. Applications with perform 'stat' on the files run on nodes which are not the nodes where the reads take place.
Workaround	Set the (undocumented) forceAttributeRefresh configuration parameter, which will force nodes to retrieve updated stat info. For example: mmchconfig forceAttributeRefresh=60 -i

5.2.3.3

ALL Operating System environments

IJ55304

The assert can sometimes happen due to token reference count leak. (show details)

Symptom	Abend/Crash
Environment	ALL Operating System environments
Trigger	When token transfer goes through certain code path
Workaround	Disable the assert may be OK in most cases.

5.2.3.3

All Scale Users

IJ55373

On a large file system, tsapolicy may not free all queue elements processed during directory scan and could result in OOM. During directory scan, tsapolicy allocates queue elements to keep each directory entry for scanning and assigns a unique correlation number to each queue element. The correlation number is used as a watermark for the server process to free queue elements that have been processed when a client completes its assignment. But it is an int32 type and can be overflowed in a large file system. The number needs to be reset to 1 when it overflows int32. (show details)

Symptom	Component Level Outage
Environment	all platforms that support mmapplypolicy
Trigger	run mmapplypolicy on a large file system
Workaround	none

5.2.3.3

mmapplypolicy

IJ55376

A check is called too often which can be problematic when many discs are checked leading to waiters affecting IO performance.
Check /var/adm/ras/mmsysmonitor.log for
[I] Timeout RunCmd Command /usr/lpp/mmfs/bin/mmremote getLocalNsdData -X timed out after 42 sec. Sending SIGTERM and /var/adm/ras/mmfs.log for "waiters" (show details)

Symptom	Error output/message Slow IO
Environment	ALL Operating System environments
Trigger	A check is called too often which can be problematic when many discs are checked
Workaround	As a quick fix the check can be disabled using mchconfig mmhealth-disk-check_nsd=False --force after update set parameter to True to re-enable the check

5.2.3.3

System Health

IJ55379

The timeout test result is not consistent on AMD EPYC-Turin Processor. If the test passes, the GSKIT hangs workaround will not be applied. This causes problem later (show details)

Symptom	Installation and admin commands hang.
Environment	Linux OS environments
Trigger	This problem affects AMD EPYC-Turin.
Workaround	Manually apply the workaround

5.2.3.3

Admin Commands, gskit

IJ55381

mmfs.log: logAssertFailed: maxExpellableQuorumNodes>=0 (show details)

Symptom	Abend/Crash
Environment	All
Trigger	Cluster manager trying to process an expel request after quorum has been lost.
Workaround	None

5.2.3.3

GPFS Core

IJ55406

The 'mmkeyserv tenant delete' command fails to remove the tenant definition from the Storage Scale cluster when the tenant no longer exists on the key server. (show details)

Symptom	Command failure
Environment	AIX and Linux
Trigger	The tenant was removed from the key server prior to invoking the 'mmkeyserv tenant delete' command. This occurs on new version of GKLM.
Workaround	Reissue the command with --force option.

5.2.3.3

Admin Commands
Encryption

IJ55407

An IO operation from NFS Ganesha may not always have the client ip address information. With FAL and NFS Ganesha enabled, if a previous NFS IO operation for a file system had the client IP address information associated with it, this IP can potentially be used to in audit events for another NFS IO operation of another filesystem, resulting in a mismatch of NFS client IPs in events of different file systems. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	- Enable File Audit Logging for at least two file systems - Have NFS Ganesha running - Have at least two NFS clients that mount the exports of the file systems to the same CES node. - Generate IOs from the clients to the exports. - A small set of events in the audit logs of each file system would contain incorrect/mismatch nfs client ips.
Workaround	None

5.2.3.3

File Audit Logging, NFS

IJ55408

During AFM migration, files deleted on the target are not being removed from the local cache if the parent is dirty. "mmafmctl checkUncached" command reports these files are uncached. (show details)

Symptom	Unexpected results.
Environment	All OS Environments
Trigger	AFM migration with deleted files/dirs at the target
Workaround	None

5.2.3.3

AFM

IJ55647

Files from the AFM cache may be incorrectly deleted or moved to the .ptrash directory when using the afmFastCreate option. A file may be incorrectly deleted from the cache when a newly created file is renamed. (show details)

Symptom	Unexpected Results
Environment	All OS Environments
Trigger	Using afmFastCreate option with AFM caching.
Workaround	Disable afmFastCreate option

5.2.3.3

AFM

IJ55648

A deadlock may occur in the AFM environment when afmFastLookup is disabled, due to a lock ordering issue. This can lead to cluster-wide hangs. (show details)

Symptom	Deadlock
Environment	All OS Environments
Trigger	AFM caching under high workload
Workaround	Enable afmFastLookup option

5.2.3.3

AFM

IJ55409

Parallel Read considers only unique Remote site mapped Gateway nodes for spawning READ_SPLIT messages except for GPFS backend. Same should be considered for Object backend also because all Gateway nodes will be mapped to the same remote target. (show details)

Symptom	Unexpected behavior
Environment	Linux Only
Trigger	Read on large object on the AFM COS backend with a mapping target
Workaround	None

5.2.3.3

AFM

IJ55188

If the utimensat() system call (which is used in the "touch" command) is issued on a file shortly after it was issued, it may not take effect and fail to update the file's modification time ("mtime"). (show details)

Symptom	Unexpected Results/Behavior
Environment	ALL Linux OS environments
Trigger	Issuing the utimensat() system call on a given file multiple times in quick succession (< 1 second).
Workaround	Whenever feasible, wait at least 1 second between subsequent invocations of utimensat() on the same file.

5.2.3.3

All Scale Users

IJ55601

A previous network issue can lead into a log/trace entry like
totalReceived == scatteredP->scattered_total_len || (totalReceived == 0 && scatteredIndex == scatteredP->scattered_count) followed by an assert causing the node to stop. (show details)

Symptom	Assert/Crash
Environment	Linux Only
Trigger	A network failure triggering some state counters/variables in an undefined state
Workaround	None

5.2.3.3

Scale

IJ55167

IBM has identified potential security leak or data access loss issue for files created from SMB clients. The issue may appear when SMB clients create files in folders that use ACL inheritance to change ACLs (additional access to groups, reduced access to a users primary group) from the default access mask. (show details)

Symptom	incorrect ACL written
Environment	Linux Only
Trigger	File creation via SMB protocol in folders with ACL inheritance
Workaround	None

5.2.3.2

CES SMB

IJ55170

This issue often shows up when running git clone into an NFS-mounted directory.

Below is an example of the error that may occur:

$ git clone https://github.com/jupp0r/prometheus-cpp
Cloning into 'prometheus-cpp
'...remote: Enumerating objects: 5577, done.
remote: Counting objects: 100% (1562/1562), done.
remote: Compressing objects: 100% (373/373), done.
remote: Total 5577 (delta 1287), reused 1189 (delta 1189), pack-reused 4015 (from 2)
Receiving objects: 100% (5577/5577), 1.32 MiB | 7.94 MiB/s, done.
fatal: could not open '/mnt/nfs4/prometheus-cpp/.git/objects/pack/tmp_pack_fADmRg' for reading: Permission denied
fatal: fetch-pack: invalid index-pack output (show details)

Symptom	Permission denied error
Environment	Linux Only
Trigger	Permission denied error encountered during git clone
Workaround	None

5.2.3.2

NFS

IJ55184

Scale 5.2.3 PTF1 and PTF2 contain a code change leading to possible slower read performance.
The problem exists in 5.2.3 PTF1 and the fix is in 5.2.3 PTF2 (show details)

Symptom	Slower performance than expected.
Environment	ALL Linux OS environments and "Windows/x86_64"
Trigger	Any regular read workload can incur additional overhead. This has been specifically observed with the ior hard read benchmark, but could affect any read workload.
Workaround	Do not use Scale 5.2.3 PTF1 or PTF2.

5.2.3.2

All Scale Users

IJ54628

Not able to read uncached file during Resync when AFM queue is on queueOnly state. (show details)

Symptom	Uncached file read failure during Resync
Environment	Linux Only
Trigger	Read on uncached file while AFM resync is queueing ops on gateway node.
Workaround	None

5.2.3.1

AFM

IJ53214

With FAL and NFS Ganesha enabled, running workloads with path to an NFS export for long periods of time could result in NFS client ips not being logged in the audit log. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	- Wit FAL and NFS Ganesha enabled, run workloads with path to the NFS mount point for long periods of time
Workaround	- Restart NFS Ganesha if NFS client ips are not being logged

5.2.3.1

File Audit Logging, NFS

IJ54629

mmrestorefs recreates all files and directories that were deleted after the snapshot was taken.If the deleted file is a special file, mmrestorefs uses mknod() system call to create the file.But mknod() cannot create a socket file on AIX. Hence, if socket files were deleted after the snapshot was taken,mmrestorefs on AIX will fail during re-creating the socket file. (show details)

Symptom	Component Level Outage
Environment	AIX only
Trigger	run mmrestorefs when a socket file was deleted after the snapshot was taken.
Workaround	none

5.2.3.1

mmrestorefs

IJ54802

If mmrestripefs command issued while mmreclaimspace command is running, we could expect the assert. (show details)

Symptom	Abort
Environment	Linux / AIX
Trigger	On a file system with thin-provisioning (or space-reclamation only as well) enabled, run mmrestripefs while mmreclaimspace command running for space-reclamation
Workaround	None

5.2.3.1

space-reclamation

IJ54804

Weighted RGCM Log group rebalance issue. (show details)

Symptom	Abend
Environment	Linux Only
Trigger	If we have slightly different log groups weights on the same server which might not balance the heavy weighted log groups.
Workaround	None

5.2.3.1

ESS/GNR

IJ54783

When trying to install Storage Scale on Windows with latest Cygwin version (3.6.1), the installation can fail due to security issues. (show details)

Symptom	Upgrade/Install failure.
Environment	Windows/x86_64 only
Trigger	Upgrading Cygwin to version 3.6.1 before trying to install Storage Scale on Windows
Workaround	Downgrade Cygwin to version 3.6.0 or below before attempting to install Storage Scale on Window

5.2.3.1

Install/Upgrade

IJ53557

GPFS asserted due to unexpected hold count on events exporter object during destructor. (show details)

Symptom	Assert
Environment	All platforms
Trigger	A race condition between EventsExporterReceiverThread and EventsExporterListenThread and an error path where the destructor is called
Workaround	None

5.2.3.1

All Scale Users

IJ54868

'(' character in the undefined value for default needs to be escaped. Failing which the propogation of theconfig to other nodes seems to throw error as syntax error. (show details)

Symptom	Unexpected Behavior
Environment	All OS Environments
Trigger	Tune the afmRecoveryDir back to its default value.
Workaround	None

5.2.3.1

AFM

IJ54878

If the dependent fileset is created as a non-root user and linked, then the uid/gid are not replicated for the dependent fileset to the remote site. (show details)

Symptom	Unexpected Behavior
Environment	Linux Only
Trigger	Create and Link dependent fileset inside DR primary fileset as a non-root user.
Workaround	None

5.2.3.1

AFM

IJ54968

opening a new file with O_RDWR|O_CREAT fails with EINVAL. (show details)

Symptom	file creation returns an error of EINVAL>
Environment	Linux Only
Trigger	Unknown
Workaround	None

5.2.3.1

Scale Core

IJ54967

crash during cxiStrcpy in setSecurityXattr (show details)

Symptom	Crash
Environment	Linux Only
Trigger	file creation with selinux enabled.
Workaround	None

5.2.3.1

Scale core

IJ54966

Kernel Crash with selinux enabled (show details)

Symptom	Crash
Environment	Linux Only
Trigger	file creation with selinux enabled.
Workaround	None

5.2.3.1

Scale core

IJ54965

NFSV4 ACLs are not replicated with AFM fileset level options afmSyncNFSV4ACL and afmNFSV4 (show details)

Symptom	Unexpected results
Environment	Linux Only
Trigger	Using options afmSyncNFSV4ACL and afmNFSV4 to replicate NFSv4 ACLs.
Workaround	None

5.2.3.1

AFM

IJ54963

Symlinks are appended with a null character, which causes the pwd -P command to fail to resolve the real path. (show details)

Symptom	Unexpected results
Environment	Linux Only
Trigger	AFM caching with symlinks
Workaround	None

5.2.3.1

AFM

IJ54962

Snapshots are not listed under .snapshots directory when the AFM is enabled on the file system (show details)

Symptom	Unexpected results
Environment	All OS environments
Trigger	Listing snapshots when AFM is enabled on the file system
Workaround	None

5.2.3.1

AFM

IJ54975

"mmhealth cluster show" my report an additional GUI pod after upgrade or rebalancing. (show details)

Symptom	Unexpected Results/Behavior
Environment	Open Shift (CNSA)
Trigger	CNSA Upgrade or other rebalancing action of GUI pods.
Workaround	Moving the cluster manager node (mmchmgr) will ensure a resync of the data. "mmhealth node show -a --resend" will do the same

5.2.3.1

System Health

IJ54976

Nodes accessing the AFM fileset crash when the fileset is attempted to disable online with "mmchfileset -p afmTarget=disable-online" command (show details)

Symptom	Crash
Environment	Linux Only
Trigger	AFM fileset disable-online
Workaround	None

5.2.3.1

AFM

IJ54983

File Audit Logging uses an internal data structure to keep track of NFS client ip addresses for NFS IOs coming from Ganesha.
The CES nodes can crash during garbage collection of this structure due to a race condition caused by a use-after-free error. (show details)

Symptom	Abend/Crash
Environment	Linux
Trigger	- File audit logging is enabled on a file system with NFS Ganesha running. - Large amount of IOs running to NFS exports.
Workaround	- Disable File Audit Logging, or - Avoid NFS IOs when FAL is enabled

5.2.3.1

File Audit Logging, NFS

IJ54984

hit Assert exp(getChildId().isValid()) during read operation if mmafmtransfer restarted (show details)

Symptom	Crash
Environment	Linux Only
Trigger	While read is in queue kill mmafmtransfer daemon
Workaround	None

5.2.3.1

AFM

IJ54985

On doing mmchdisk start if the command encounters a corrupted inode that fails inode validation then it would not produce the interesting inode list showing the bad inode number. due to this the user did not come to know the affected inode number and had to rely on long running traces to get this information. (show details)

Symptom	pit.interestingInodes file not generated/populated
Environment	ALL Operating System environments
Trigger	When mmchdisk start encounters a corrupted inode
Workaround	Capture long running traces and provide the traces to support for getting this infomation

5.2.3.1

PIT

IJ54986

Accessing a file through mmap while the same file is accessed on other nodes, or other operations are done on other nodes, there is a small chance of a race condition leading to a logassert (show details)

Symptom	Abend/Crash
Environment	ALL Linux OS environments
Trigger	Access parts of a mmaped file initially on one node, while having concurrent access or concurrent operations on the same file on other nodes
Workaround	Avoid the concurrent operations on other nodes while the file is accessed on one node

5.2.3.1

All Scale Users

IJ54593

During token minimization, a deadlock can occur on a client node. With token minimization, a client node is first asked to give up any tokens that are only for cached files. Without the fix, calling this codepath for files that have been deleted, could result in a deadlock. (show details)

Symptom	Hang/Deadlock/Unresponsiveness/Long Waiters
Environment	ALL Linux OS environments
Trigger	Have many files cached on a client node. Delete files. Trigger a token server change, which then uses token minimization.
Workaround	Disable token minimization to avoid the problem: mmchconfig tokenXferMinimization=no. Or restart GPFS on the client node, to get out of the deadlock.

5.2.3.1

All Scale Users

IJ54987

mmrestoreconfig restores file system configuration information which includes fileset information. When recreating AFM filesets, mmrestoreconfig tries to restore afmShowHomeSnapshot attributes but AFM does not allow to set afmShowHomeSnapshot attribute for IW cache mode fileset. Hence mmrestoreconfig will fail if there is an IW cache mode fileset. (show details)

Symptom	Component Level Outage
Environment	all platforms that support mmrestoreconfig
Trigger	run mmrestoreconfig for a file system that contains IW cache mode fileset
Workaround	none

5.2.3.1

mmrestoreconfig

IJ54988

This APAR is to minimize the severity of the issue experienced during the erroneously processing of a DMAPI recall. This APAR does not correct the underlying symptom, however, it reduces the impact for customers who experience this issue. The APAR provides additional diagnostics in trace as well as the Linux kernel console. (show details)

Symptom	Customer who experienced the LogAssert (noted in the APAR title) will now receive a soft I/O error when trying to recall the file with a DMAPI 3rd party.
Environment	RHEL8 (x86_64, Power) and RHEL9 (x86_64, Power, Z)
Trigger	The initial problem was unable to be recreated in the lab.
Workaround	None

5.2.3.1

DMAPI

IJ54969

kernel panic: general protection fault / ovl_dentry_revalidate_common / mmfsd ORrunning lsof /proc on a node crashes the node (show details)

Symptom	Crash
Environment	Linux Only
Trigger	Running lsof /proc on a node crashes the node.
Workaround	None

5.2.3.1

Scale core

IJ54979

With afmFastCreate enabled, if the Create that tries to push the initial chunk of data fails to complete and gets requeued, then the requeued Create is replaying all data when it retries.And later there are a couple of Write messages that starting from offset where Create initially went inflight that is also played. Totaling to almost twice the amount of data of the file size to be replicated. (show details)

Symptom	Unexpected Behaviour
Environment	All Linux OS Environments (AFM Gateway nodes)
Trigger	afmFastCreate replication failing initially because of lock or network error and later replication being tried again.
Workaround	Set a higher value of afmAsyncDelay to push replication as far as the file is being written.

5.2.3.1

AFM

IJ54655

By default, clusters created with version 5.2.0 or later have the numaMemoryInterleave value set to yes. This should start Storage Scale daemon with interleave memory policy, but it does not. (show details)

Symptom	Performance Impact/DegradationUnexpected Results/Behavior
Environment	ALL Linux OS environments
Trigger	This issue affects customers running Storage Scale in Linux NUMA environment and the Storage Scale clusters created with version 5.2.0 or later.
Workaround	Explicitly set numaMemoryInterleave=yes using mmchconfig command. # mmchconfig numaMemoryInterleave=yes

5.2.3.1

All Scale Users

IJ55083

mmap data on Windows nodes may not be correctly written to disk on Windows node running Scale 5.1.9 PTF10 (show details)

Symptom	Data corruption
Environment	Windows/x86_64 only
Trigger	Write data from mmap applications on Windows. The data may not be written correctly to disk.
Workaround	There is no workaround. The recommendation is to not run 5.1.9 PTF 10 on Windows nodes without this fix.

5.2.3.1

All Scale Users

IJ55093

Unexpected GPFS daemon assert could happen when file system has DMAPI enabled for use with DMAPI application (show details)

Symptom	Abend/Crash
Environment	ALL Operating System environments
Trigger	File delete on DMAPI enabled file system trigger destroy event
Workaround	Disable DMAPI on the file system

5.2.3.1

DMAPI/HSM/TSM

IJ55094

When updating a resource via scalectl with the --url option, the update mask is not set, meaning the field might not get updated, or might result in validations being skipped (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux Only
Trigger	run scalectl <resource> update --url <host>:<port>
Workaround	use scalectl without the --url option, or run a REST API request with the appropriate update mask

5.2.3.1

Native Rest API

IJ55095

mmafmctl getList subcommand deletes all .* files/dir in current working directory because of a variable initialisation issue in mmafmctl script. (show details)

Symptom	Unexpected Behavior
Environment	All OS Environments
Trigger	Running the mmafmctl getList subcommand from an important work directory like /root where important OS related files might exist.
Workaround	cd working directory to a empty dir in /tmp and run the mmafmctl getList subcommandfrom there.

5.2.3.1

AFM

IJ55119

If an accessing cluster has been authorized to access a list of filesets, updating resources on the owning cluster to remove one fileset is not effective. The original list of filesets can still be accessed.
An accessing cluster may not access to remote resources after a resource update to remove and then re-add the resources. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	Authorize access to fileset resources on the owning cluster via scalectl cluster remote authorize. Then remove a fileset via scalectl cluster remote update. Perform a series of authorize, remote mount, unauthorize, remote mount actions on file system resources.
Workaround	Use mmauth to update resources

5.2.3.1

Native REST API

IJ55120

If a remote file system definition is added using scalectl filesystem remote add, the Automount value may not be correct when viewing this definition with mmremotefs. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	Add a remote file system definition via scalectl filesystem remote add. Display this file system definition via mmremotefs show. The value in the Automount column in the output of mmremotefs show shows 'mount = false' instead of 'no'.
Workaround	None

5.2.3.1

Native REST API

IJ55121

In a resource definition file, if fileset resources are specified and there is no matching file system resource. This is invalid.If the root fileset is not specified when there are fileset resources to authorize, this is invalid.
If the file system's disposition does not match its root fileset's disposition, this is invalid. scalectl cluster remote authorize may grant some resources instead of returning an error. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	No matching fs resource for the fileset resources or, No root fileset specified in the fileset resources or,Mismatch dispositions between a file system and its root fileset.
Workaround	Ensure the resource definition file grants the correct resources

5.2.3.1

Native REST API

IJ55141

If replica compare is done on block 0 of a snapshot inode0 file while the same block is being updated, a false positive replica mismatch can happen. (show details)

Symptom	Replica mismatch is reported for block 0 of snapshot inode0 file
Environment	All platforms
Trigger	Doing replica compare and updating snapshot inode0 file at the same time
Workaround	None

5.2.3.1

core GPFS

IJ53815

During or after upgrade of manager nodes to 5.2.0.0+, deadlock can occur. (show details)

Symptom	Cluster/File System Outage
Environment	All Operating System environments
Trigger	Manager node(s) are on version 5.2.0.0+. Client nodes are running a release prior to 5.1.9.0, or morethan one file system is under token migration at the same time.
Workaround	None

5.2.3.0

All Scale Users

IJ53828

If the customer executes the systemops command, it will allow any command to be executed, as there is no specific command validation in place. (show details)

Symptom	None
Environment	Linux Only
Trigger	No such conditions
Workaround	None

5.2.3.0

No Such restriction

IJ54043

When a file system maintenance command - mmrestripefs - or disk maintenance command - mm{ch,del,rpl}disk - run, the 'thin inode' is deallocated and the emergency space is deleted. This is unexpected behavior and could be a problematic if the file system gets OOS (Out-Of-Space) condition. (show details)

Symptom	After one of the commands run, the 'thin inode' in the internal dump will reset to -1 indicating that the 'thin inode' is deallocated and 'nBlocks' will be 0 indicating the emergency space is deleted. [root@c145f11san04b sju]# mmfsadm dump stripe \| egrep "State of\|thin inode" State of StripeGroup "test" at 0x18042A6A5B0, uid 7491A8C0:67D9B824, local id 1: 0: name 'system' Valid nDisks 32 nInUse 32 id 0 poolFlags 2 thin inode 41 nBlocks 519 1: name 'data' Valid nDisks 16 nInUse 16 id 65537 poolFlags 2 thin inode 42 nBlocks 526 2: name 'flash' Valid nDisks 8 nInUse 8 id 65538 poolFlags 2 thin inode 43 nBlocks 526 [root@c145f11san04b sju]# mmrestripefs test -R -N nc1 ... [root@c145f11san04b sju]# mmfsadm dump stripe \| egrep "State of\|thin inode "State of StripeGroup "test" at 0x18042A6A5B0, uid 7491A8C0:67D9B824, local id 1: 0: name 'system' Valid nDisks 32 nInUse 32 id 0 poolFlags 2 thin inode -1 nBlocks 0 1: name 'data' Valid nDisks 16 nInUse 16 id 65537 poolFlags 2 thin inode -1 nBlocks 0 2: name 'flash' Valid nDisks 8 nInUse 8 id 65538 poolFlags 2 thin inode -1 nBlocks 0
Environment	Linux/AIX
Trigger	Run one of the command - mmrestripefs or mm{ch,del,rpl}disk
Workaround	None

5.2.3.0

thin-provisioning

IJ54044

Because of the limitation in the current implementation in managing reserved inode pool, 'thin inode' can be shared erroneously with policy file while it is still assigned to the emergency space on the file system with SSS6K and FCM4 drives. This will trigger an 'assert' because of the corruption by being shared the same inode. And, it could also the file system metadata corruption by itself which will have consequence its own. (show details)

Symptom	After one of the commands run, the 'thin inode' in the internal dump will reset to -1 indicating that the 'thin inode' is deallocated and 'nBlocks' will be 0 indicating the emergency space is deleted. [root@c145f11san04b sju]# mmfsadm dump stripe \| egrep "State of\|thin inode "State of StripeGroup "test" at 0x18042A6A5B0, uid 7491A8C0:67D9B824, local id 1: 0: name 'system' Valid nDisks 32 nInUse 32 id 0 poolFlags 2 thin inode 41 nBlocks 519 1: name 'data' Valid nDisks 16 nInUse 16 id 65537 poolFlags 2 thin inode 42 nBlocks 526 2: name 'flash' Valid nDisks 8 nInUse 8 id 65538 poolFlags 2 thin inode 43 nBlocks 526 [root@c145f11san04b sju]# mmchpolicy test /home/sju/policy-default ... [root@c145f11san04b sju]# mmchmgr test c145f11san04a [root@c145f11san04b sju]# mmchpolicy test /home/sju/policy-default [root@c145f11san04b sju]# mmfsadm dump stripe \| grep "policy file inode" policy file inode: 41
Environment	Linux/AIX
Trigger	Run 'mmchpolicy' command multiple times.
Workaround	None

5.2.3.0

thin-provisioning

IJ54045

To help controlling the issues (refer to D.341360, D.343470, and D.343471) with file systems created from SSS6K and FCM4, a new option 'thininode' is added to the tsdbfs command. This option will be used to reset 'thin inode'. (show details)

Symptom	With the command, 'thin inode' will be reset to the value as appeared in the tsdbfs command. [root@c145f11san04b sju]# mmfsadm dump stripe \| egrep "State of\|thin inode" State of StripeGroup "test" at 0x18042A6A5B0, uid 7491A8C0:67D9B824, local id 1: 0: name 'system' Valid nDisks 32 nInUse 32 id 0 poolFlags 2 thin inode 41 nBlocks 519 1: name 'data' Valid nDisks 16 nInUse 16 id 65537 poolFlags 2 thin inode 42 nBlocks 526 2: name 'flash' Valid nDisks 8 nInUse 8 id 65538 poolFlags 2 thin inode 43 nBlocks 526 [root@c145f11san04b sju]# tsdbfs test patch desc thininode 0 -1 [root@c145f11san04b sju]# tsdbfs test patch desc thininode 1 -1 [root@c145f11san04b sju]# tsdbfs test patch desc thininode 2 -1 [root@c145f11san04b sju]# mmfsadm dump stripe \| egrep "State of\|thin inode" State of StripeGroup "test" at 0x18042A6A5B0, uid 7491A8C0:67D9B824, local id 1: 0: name 'system' Valid nDisks 32 nInUse 32 id 0 poolFlags 2 thin inode -1 nBlocks 519 1: name 'data' Valid nDisks 16 nInUse 16 id 65537 poolFlags 2 thin inode -1 nBlocks 526 2: name 'flash' Valid nDisks 8 nInUse 8 id 65538 poolFlags 2 thin inode -1 nBlocks 526
Environment	Linux/AIX
Trigger	Refer to D.343470 and D.343471 on the issues/symptoms and how to trigger.
Workaround	None

5.2.3.0

thin-provisioning

IJ54079

An application using SMB server may invoke the gpfs_stat_x() call (available in libgpfs.so) to retrieve stat information for a file.Such call implements "statlite" semantics, meaning that the size information is not assured to be the latest. Other applications which invoke standard stat()/fstat() calls do expect the size information to be up to date.However, due a problem in the logic, after gpfs_stat_x() is invoked, information is cached inside the kernel, and the cache is not purged even when other nodes change the file size (for example by appending data to it). The result is that stat() invoked on the node may still retrieve out of date file size information as other nodes write into the file. (show details)

Symptom	Unexpected Results/Behavior
Environment	ALL Operating System environments
Trigger	SMB Applications invoking gpfs_stat_x() will cause wrong file size information to be retrieved by stat()/fstat() invoked by other applications.
Workaround	None

5.2.3.0

All Scale Users

IJ54328

Incorrect snapshot data—either stale or uninitialized—may be read while the mmchdisk start command is being executed on file systems with replication enabled. (show details)

Symptom	Data corruption, snapshot data read may not be as expected.
Environment	All platforms
Trigger	The issue may happen if some data replicas are stale or uninitialized, and snapshot data is accessed while running the mmchdisk start command to repair the bad replicas.
Workaround	Avoid accessing snapshot data while running the mmchdisk start command.

5.2.3.0

GPFS core