| IJ56483 |
High Importance
|
When the number of available file descriptors are exhausted, a logAssertFailed will trigger
(show details)
| Symptom |
Error output/message |
| Environment |
ALL Operating System environments |
| Trigger |
All available file descriptors or sockets exhausted for the process. |
| Workaround |
Restarting the mmfsd daemon. |
|
5.2.3.5 |
All Scale Users |
| IJ56498 |
High Importance
|
A race condition can lead to broken secure connections (cipherList set to AES128_SHA256 or AES256_SHA256) during connection reconnect, either when proactive reconnect is enabled or when a network issue triggers a connection disconnect in mmfsd.
(show details)
| Symptom |
Customers may see error messages like the following in the GPFS message log files:
[N] Close connection to 10.28.2.47 c81f2u17vm1 <c0n2>:[0] (Unexpected error 22)
Unknown error: err = 22, error source = Error sending a message: ERR_SRC_SENDMSG_ERROR, error number = 19 |
| Environment |
ALL Operating System environments |
| Trigger |
Proactive reconnect is enabled or network issues triggers a connection disconnect. |
| Workaround |
Disable proactive reconnect (revert to the default setting) and ensure the network connectivity between the GPFS nodes is working correctly. |
|
5.2.3.5 |
All Scale Users |
| IJ51071 |
High Importance
|
The mmdefragfs command runs jobs across helper nodes. Each of those jobs can require finding owners of allocation regions. In some extreme cases, overlapping ownership queries can result in deadlocks.
(show details)
| Symptom |
Deadlock |
| Environment |
ALL Operating System environments |
| Trigger |
mmdefragfs is running jobs on multiple nodes. If the ownership of allocation regions is unknown, those must be queried. The logic for tracking ownership dictates that if a regin without a current owner is queried, the node issuing the query must request ownership. If the same regions are queried from multiple nodes, the ownership is unknown at first but changes, this results in overlapping revoke requests, resulting in a deadlock. |
| Workaround |
Workaround could be to not run mmdefragfs or to only run mmdefragfs on a single node. |
|
5.2.3.5 |
All Scale Users |
IJ55679 |
Critical |
File system manager node could fail unexpectedly with assert exp((indIndex & 0xFF00000000000000ULL) == 0) in IndDesc.h. This could happen when expanding number of allocated inode on a file system with very high number of allocated inode already.
(show details)
| Symptom |
Abend/Crash |
| Environment |
ALL Operating System environments |
| Trigger |
Increasing number of allocated inode |
| Workaround |
Avoid creating new independent fileset and increasing number of allocated inode. |
|
5.2.3.5 |
All Scale Users |
| IJ56253 |
High Importance
|
If a filesystem has quotas enabled and a file is unlinked (its last directory entry removed) before a chown is performed, the chown call will fail with ENOENT, even though the file descriptor remains open and valid.
(show details)
| Symptom |
Unexpected Results/Behavior |
| Environment |
All Operating System environments |
| Trigger |
- Enable quota (file system or fileset level)
- Create a file, unlink it while the file descriptor is still valid.
- Set ownership for this file descriptor.
- Close the file descriptor. |
| Workaround |
Disable quotas |
|
5.2.3.5 |
Quotas |
| IJ56485 |
High Importance
|
Signal 11 in IterInodes::getSortedSnapshots() at filesys.h, resulting in a mmfsd crash.
(show details)
| Symptom |
Abend/Crash |
| Environment |
ALL Operating System environments |
| Trigger |
Run batch snapshot workloads. |
| Workaround |
None |
|
5.2.3.5 |
Snapshots |
| IJ56486 |
High Importance
|
When mmfsckx finds that it does not have the minimum pagepool needed to run then it will exit. But on exitting it was not deletingthe fsckx snapshot. The reason was that the point of abort was happening before it was able to initialize its internal vector lists which ituses to delete the fsckx snapshot on exit.
(show details)
| Symptom |
Unexpected Results/Behavior |
| Environment |
All |
| Trigger |
mmfsckx exiting due to insufficient pagepool available to run. |
| Workaround |
Use mmdelsnapshot to delete the snapshot. |
|
5.2.3.5 |
mmfsckx |
| IJ56487 |
Medium Importance |
Changes to the perfmon configurations are not updated on nodes that were down at the time of the change were made.
(show details)
| Symptom |
perfmon configure is not updated. |
| Environment |
Linux Only |
| Trigger |
Perfmon configuration is not updated on nodes that were down. |
| Workaround |
Reissue the mmperfmon command to update the configuration once all nodes are up, or run mmcommon run invokePerfmonctl updateon the perfmon nodes that were down. |
|
5.2.3.5 |
Perfmon |
| IJ56491 |
Suggested |
The sysmon or syslog logs frequently display the following warning messages:
statd_wrong WARNING The rpc.statd process is misconfigured.
These messages indicate that rpc.statd process is not configured properly,
(show details)
| Symptom |
statd_multiple event in cluster |
| Environment |
Linux Only |
| Trigger |
This typically occurs because rpc.statd may spawn short-lived child processes as part of its normal operation. These child processes can briefly appear as multiple running instances. Additionally, sysmon performs health checks during the exists of these child processes, which can lead to the system reporting warnings about rpc.statd process is misconfigured, even though this behavior is expected. |
| Workaround |
None |
|
5.2.3.5 |
NFS |
| IJ56619 |
Critical |
When running AIO, the thread submitting the I/O request is not the same as the one completing the I/O request. There is race condition where an AIO request that is quickly completed is still accessed from the submitting threads. This either results in a kernel KFENCE warning or a node crash.
(show details)
| Symptom |
Abend/Crash |
| Environment |
ALL Linux OS environments |
| Trigger |
Run AIO in a way that the requests a completed very quickly. This is workload dependend and might be hard to recreate. |
| Workaround |
There is no workaround, the fix is required to avoid this problem. |
|
5.2.3.5 |
All Scale Users |
| IJ56620 |
Critical |
When mmfsck detects a hole in a reserved file, it fills the hole by allocating a new disk address and adding that address to the file’s indirect block. It also updates its internal block allocation bitmap to mark the new block as in-use.
However, the internal block allocation bitmap is distributed across the scanning nodes. If the newly allocated block falls outside the region of the bitmap owned by the node that performed the allocation, the node may skip updating the bitmap. As a result, the block remains unmarked in the bitmap. This leads mmfsck to falsely later identify the block as lost. In repair mode, it then incorrectly marks the block as free. Later, when the file system is in use, it may reallocate this block to another file, resulting in duplicate block corruption.
(show details)
| Symptom |
Operation failure due to FS corruption and SGPanic |
| Environment |
ALL Operating System environments |
| Trigger |
This issue can happen when mmfsck detects and repair holes in reserved files. |
| Workaround |
Run mmfsck in repair mode (-y) again after the first repair run. |
|
5.2.3.5 |
FSCK |
| IJ56698 |
High Importance
|
GPFS daemon could fail unexpectedly with assert: exp(emptySduGroupCommit() || isSGPanicked). This could happen if disk error causes read/write of directory blockto fail.
(show details)
| Symptom |
Abend/Crash |
| Environment |
ALL Operating System environments |
| Trigger |
Disk failure |
| Workaround |
None |
|
5.2.3.5 |
All Scale Users |
| IJ56699 |
High Importance
|
Upon installation, the gpfs.scaleapi rpm changes the ownership of directories and files under /var/mmfs from root to scaleapiadmd. It excludes files under/var/mmfs/ssl/keyServ, but not the RKM.conf file located in /var/mmfs/etc. Directories and files of file systems mounted under /var/mmfs are alsoaffected. If RKM.conf has incorrect permissions or ownership, encryption may fail.
(show details)
| Symptom |
- Incorrect permission/owner of files under other file systems mounted under /var/mmfs
- mmkeyserv command failure
- Incorrect permissions for the configuration file /var/mmfs/ssl/keyServ/RKM.conf in mmfs.log |
| Environment |
Linux only |
| Trigger |
Installation of the gpfs.scaleapi rpm package, mmkeyserv command failure. Incorrect permission/owner of files under other file systems mounted under/var/mmfs |
| Workaround |
Manually change the ownership from scaleapiadmd/scaleapiadm back to root/root |
|
5.2.3.5 |
AFM, encryption |
|
Suggested |
When applications simultaneously use a writable, shared memory map (mmap) and perform regular write()/pwrite()operations to the same file that is subject to snapshot Copy-on-Write (COW), the file system can hit a three-way deadlock. The cycle involves:
• a page-faulting mmap reader that triggers COW into a previous snapshot,
• a concurrent VMA change (e.g., mremap/munmap) that requires the kernel's mmap write semaphore,
• and a regular write path that holds the inode write lock and then page-faults on its user buffer (which also needs the mmap semaphore).
Once formed, the cycle blocks progress on the affected file and can ultimately lead to automatic deadlock breakup (filesystem panic/unmount) depending on configuration.
(show details)
| Symptom |
•Application threads or system threads hang on file I/O to the affected file.
•Trace/logs show CopyDataOnWriteHandlerThread waiting on inode rf, a writer holding wa and blocked in a page fault, and a VMA operation holding/waiting the mmap write semaphore.
•With deadlock breakup enabled, Scale may log multi-phase “deadlock breakup” and unmount/panic the impacted filesystem. |
| Environment |
All supported OS environments. |
| Trigger |
This issue affects customers that:
•Use writable, shared mmap on files that may require snapshot COW, and
•Perform regular write()/pwrite() to the same file, and
•Occasionally execute VMA-altering operations such as mremap/munmap on the mapping.
A deadlock can occur when:
•An mmap page fault (“PF reader”) triggers CopyDataOnWrite for a prior snapshot and needs the inode rf lock.
•A concurrent writer holds the inode wa lock, then page-faults on its user buffer and must acquire the mmap semaphore.
•A concurrent mremap/munmap seeks the mmap semaphore as writer, blocking page-fault progress.This forms a cycle (PF reader ↔ writer ↔ mremap) that stalls I/O on the file. |
| Workaround |
•Avoid concurrent VMA changes (mremap/munmap) while a file is actively accessed via writable shared mmap and regular writes on a snapshot-eligible file.
•Where feasible, separate write bursts from mmap page-fault activity on the same file, or map readers MAP_PRIVATEif application semantics allow.
(These are operational mitigations only; they do not fully prevent the issue.) |
|
5.2.3.5 |
GPFS/Scale — mmap, snapshot Copy-on-Write, locking. |
| IJ56680 |
Suggested |
mmbackup verifies directory size as one of the triggers to select objects to be sent to IBM Storage Protect Server. Since directory size will be calculated during restore, if only size is different, no need to re-backup the directory. Hence, mmbackup will not verify size during during backup candidate selection process if the object is directory.
(show details)
| Symptom |
mmbackup may select unchanged directories as backup candidates |
| Environment |
ALL OS that supports mmbackup |
| Trigger |
run live fs backup and then run snapshot backup |
| Workaround |
none |
|
5.2.3.5 |
mmbackup |
| IJ56142 |
High Importance
|
With workloads that heavily lookup or traverse symlinks, contention can occur inside GPFS. The problem is that every symlink lookup request from an application results in the symlink target being queried from the file system, resulting in possible contention on internal locks.
(show details)
| Symptom |
Performance Impact/Degradation |
| Environment |
ALL Linux OS environments |
| Trigger |
The problem is caused by heavily concurrent lookups of the same symlink by many threads. |
| Workaround |
There is no workaround. |
|
5.2.3.5 |
All Scale Users |
| IJ56488 |
Suggested |
A hang can occur when three operations hit the same file at once:
a process touches a shared, writable mmap mapping and faults a page,
another thread/process performs mremap (needing the mmap write semaphore), and a concurrent write()/pwrite() to the same region.
Under certain timing, the page-fault path must fetch a file lock from the daemon, while the writer is also fetching a conflicting lock. The result is a lock/semaphore cycle between the page-fault handler, the writer, and mremap, and I/O to that file can stall indefinitely.
(show details)
| Symptom |
Threads hang in file operations; GPFS traces show the mmap page-fault path waiting on a fetched lock, a writer stuck on the mmap semaphore after initiating a daemon fetch, and mremap waiting for the semaphore upgrade. No progress until GPFS services are restarted.
Fix description (high level):
Extend the existing mmap uXfer (“borrowed lock”) fast-path into the daemon fetch path. When the kernel’s lock attempt requires a fetch, the daemon can—under safe conditions—temporarily “borrow” a read lock for the page-fault request and signal the kernel to proceed, breaking the cycle while preserving correctness. (Normal lock/token ownership is finalized once the fetch completes; error paths are handled so the kernel falls back safely if borrowing isn't possible.)
|
| Environment |
ALL OS environments |
| Trigger |
File is mmap'd MAP_SHARED|PROT_WRITE (or read-only with faults against the same region) while a concurrent write()/pwrite() targets the same range.
A mremap occurs concurrently, contending on the mmap semaphore.
Lock acquisition in the kernel returns E_NEED_FETCH and both the page-fault path and the writer rely on the daemon to fetch/upgrade the inode lock; specific timing can create a cyclic wait.
|
| Workaround |
None practical. (Avoiding concurrent mmap access and mremap/writes to the same region prevents the issue but is often not feasible.) |
|
5.2.3.5 |
All Scale Users |
| IJ56490 |
Critical |
Creating a new storage pool with allowWriteAffinity set to YES could lead to unexpected FSSTRUCT error if only a single small disk is used during storage pool creation.
(show details)
| Symptom |
Error output/message |
| Environment |
ALL Operating System environments |
| Trigger |
Creating a new storage pool with allowWriteAffinity set to YES |
| Workaround |
Add multiple disks when create a new storage pool with allowWriteAffinity set to YES. |
|
5.2.3.5 |
All Scale Users |
| IJ56734 |
High Importance
|
When reading from snapshot files, applications may encounter unexpected non-zero data in blocks that were never written to in the original (root) filesystem. These blocks were part of a pre-allocated file but remained uninitialized, and therefore should logically contain zeros. The error occurs because the snapshot exposes raw, uninitialized disk contents—garbage data—at these locations. This issue is specific to snapshots and does not occur when reading from the root filesystem, where such blocks are correctly interpreted as zero.
(show details)
| Symptom |
Unexpected Results/Behavior |
| Environment |
ALL Operating System environments |
| Trigger |
If both the snapshot and the root filesystem contain a block that was pre-allocated but never written to, reading from the snapshot may return uninitialized data ("garbage") instead of zeroes. |
| Workaround |
None |
|
5.2.3.5 |
Snapshots |
| IJ56735 |
High Importance
|
Ganesha crashed at nfs3_create()-> fsal_internal_close()
(gdb) bt
#0 0x00007fcff18b3bbf in raise ()
#1 0x00007fcff37bd193 in crash_handler
#2 <signal handler called>
#3 0x00007fcff10f752f in raise ()
#4 0x00007fcff10cae65 in abort ()
#5 0x00007fcff10cad39 in __assert_fail_base.cold.0 ()
#6 0x00007fcff10efe86 in __assert_fail ()
#7 0x00007fcfebccd6ec in fsal_internal_close (fd=0, owner=0x0, cflags=0)
#8 0x00007fcfebcc240c in gpfs_reopen_func (obj_hdl=0x7fcecc387e30, openflags=67, fsal_fd=0x7fcecc387ee0)
#9 0x00007fcfebcc2954 in open_by_handle (obj_hdl=0x7fcecc387e30, state=0x0, openflags=67, createmode=FSAL_UNCHECKED, verifier=0x7fcf3f7b5e4c "", attrs_out=0x7fcf3f7b5a70)
#10 0x00007fcfebcc3322 in gpfs_open2 (obj_hdl=0x7fcfce7a2080, state=0x0, openflags=67, createmode=FSAL_UNCHECKED, name=0x7fcecc3d5400 "version.txt", attr_set=0x7fcf3f7b5f60, verifier=0x7fcf3f7b5e4c "", new_obj=0x7fcf3f7b5b70, attrs_out=0x7fcf3f7b5a70, caller_perm_check=0x7fcf3f7b5c2f)
#11 0x00007fcff38c6718 in mdcache_open2 (obj_hdl=0x7fcfdad0a8a8, state=0x0, openflags=67, createmode=FSAL_UNCHECKED, name=0x7fcecc3d5400 "version.txt", attrs_in=0x7fcf3f7b5f60, verifier=0x7fcf3f7b5e4c "", new_obj=0x7fcf3f7b6080, attrs_out=0x7fcf3f7b5e60, caller_perm_check=0x7fcf3f7b5c2f)
#12 0x00007fcff37915b6 in open2_by_name (in_obj=0x7fcfdad0a8a8, state=0x0, openflags=67, createmode=FSAL_UNCHECKED, name=0x7fcecc3d5400 "version.txt", attr=0x7fcf3f7b5f60, verifier=0x7fcf3f7b5e4c "", obj=0x7fcf3f7b6080, attrs_out=0x7fcf3f7b5e60)
#13 0x00007fcff3793e7d in fsal_open2 (in_obj=0x7fcfdad0a8a8, state=0x0, openflags=67, createmode=FSAL_UNCHECKED, name=0x7fcecc3d5400 "version.txt", attr=0x7fcf3f7b5f60, verifier=0x7fcf3f7b5e4c "", obj=0x7fcf3f7b6080, attrs_out=0x7fcf3f7b5e60)
#14 0x00007fcff38a6f36 in nfs3_create (arg=0x7fcecca10db0, req=0x7fcecca10580, res=0x7fceccaa03b0)
#15 0x00007fcff37b81c7 in nfs_rpc_process_request (reqdata=0x7fcecca10580, retry=false)
#16 0x00007fcff37b8764 in nfs_rpc_valid_NFS (req=0x7fcecca10580)
#17 0x00007fcff352deee in svc_vc_decode (req=0x7fcecca10580)
#18 0x00007fcff352911a in svc_request (xprt=0x7fcfc536cd70, xdrs=0x7fcfcd737830)
#19 0x00007fcff352ddf3 in svc_vc_recv (xprt=0x7fcfc536cd70)
#20 0x00007fcff352909a in svc_rqst_xprt_task_recv (wpe=0x7fcfc536d038)
#21 0x00007fcff35355bd in work_pool_thread (arg=0x7fcfca4779d0)
#22 0x00007fcff18a91ca in start_thread ()
#23 0x00007fcff10e28d3 in clone ()
(show details)
| Symptom |
Crash |
| Environment |
Linux Only |
| Trigger |
This issue affects only NFSv3. A crash in the nfs-ganesha service may occur when an NFS client concurrently performs a file creation operation and attempts to fix a broken symbolic link pointing to the same file. |
| Workaround |
None |
|
5.2.3.5 |
NFS-Ganesha |
| IJ56736 |
Suggested |
Starting in 5.2.3.0 gpfs.base required openssl libraries, specifically for mmfsd. Although the binary required the libraries as some symbols were defined, they were unused. This was introduced with the release of the IBM Storage Scale native REST API feature. All communication from scaleadmd to mmfsd is done over a local Unix Domain Socket and ssl is not in use.
(show details)
| Symptom |
Installs package that is required, but unused |
| Environment |
Linux Only |
| Trigger |
Install gpfs.base |
| Workaround |
None |
|
5.2.3.5 |
Linux Only |
| IJ56737 |
Medium Importance |
In version 5.2.3, the gpfs.snap command collects a listing of all files and directories under /var/mmfs. If AFM mount target paths or large filesystems are mounted under /var/mmfs, the command can take an extended amount of time or appear to hang.
(show details)
| Symptom |
The command appears to hang or takes an unusually long time to complete. |
| Environment |
All |
| Trigger |
Running gpfs.snap on systems with large AFM mount targets. |
| Workaround |
remote AFM targets or other file systems under /var/mmfs before invoking the gpfs.snap command. |
|
5.2.3.5 |
AFM, gpfs.snap command |
| IJ56738 |
High Importance
|
GPFS daemon thread VdiskMetadataWorkerThread is deadlocked with 'wait for GNR buffers from steal thread'
(show details)
| Symptom |
Deamon deadlock |
| Environment |
Linux Only |
| Trigger |
Heavy workload with limited amount of buffers. |
| Workaround |
None |
|
5.2.3.5 |
GNR |
| IJ56747 |
Suggested |
The sysmon or syslog logs frequently display the following warning messages:
[I] Event raised: The rpc.statd process is running.
[W] Event raised: The rpc.statd process is running multiple times.
These messages indicate that multiple instances of the rpc.statd process are being detected.
(show details)
| Symptom |
statd_multiple event in cluster |
| Environment |
Linux Only |
| Trigger |
This typically occurs because rpc.statd may spawn short-lived child processes as part of its normal operation. These child processes can briefly appear as multiple running instances. Additionally, sysmon performs health checks during the exit of these child processes, which can lead to the system reporting warnings about multiple instances of rpc.statd running, even though this behavior is expected. |
| Workaround |
None |
|
5.2.3.5 |
NFS |
| IJ55302 |
High Importance
|
Reading of compressed file could fail unexpectedly with E_INVAL error. This could happen when reading last block of a compressed file.
(show details)
| Symptom |
IO error |
| Environment |
ALL Operating System environments |
| Trigger |
Concurrent read of compressed file on the same node |
| Workaround |
Avoid concurrent read of compressed file on the same node |
|
5.2.3.4 |
GPFS Native Compression |
| IJ55097 |
High Importance
|
Storage Scale's file system encryption functionality does not allow the use of user-provided certificates that do not strictly conform to RFC 5280. This fix allows the use of certificates with policies that do not conform to RFC 5280.
(show details)
| Symptom |
The simplified setup for file system encryption will fail; or when using regular setup, retrieving keys from the key server may fail. |
| Environment |
All |
| Trigger |
The problem is triggered by non-RFC 5280-compliant certificates. |
| Workaround |
Use certificates that conform to the RFC 5280 specification. |
|
5.2.3.4 |
Encryption |
| IJ55699 |
High Importance
|
Resync is not able to fix the acls for few dirs where acls were set before enabling AFM at home.
(show details)
| Symptom |
Acls are not in sync |
| Environment |
All Linux OS environments |
| Trigger |
AFM caching without enabling AFM at home and setting acls on dirs. |
| Workaround |
None |
|
5.2.3.4 |
AFM |
| IJ55700 |
High Importance
|
Recovery is not able to create a hardling to softlink and failing with error 22.
(show details)
| Symptom |
Recovery is not progressing with Link operation in queue. |
| Environment |
All Linux OS environments |
| Trigger |
AFM caching in recovery with hardlink operation in queue which is created for softlink file. |
| Workaround |
None |
|
5.2.3.4 |
AFM |
| IJ55701 |
High Importance
|
mmfsck and mmfsckx cannot detect and repair if there is corruption in directory such that it has CDITTOs in them.
(show details)
| Symptom |
FSSTRUCTs |
| Environment |
All |
| Trigger |
Unknown |
| Workaround |
None |
|
5.2.3.4 |
mmfsck and mmfsckx |
| IJ55287 |
Critical |
Scale 5.2.2 and newer can leak kernel memory when running mmap workloads.
(show details)
| Symptom |
Leak of kernel memory. |
| Environment |
ALL Linux OS environments |
| Trigger |
Running mmap workloads where GPFS has to handle many mmap writeback requests can hit a codepath where an asynchronous queue is full, and a synchronous fallback codepath allocates memory without freeing it. This is more likely hit with heavy mmap workloads that map large file ranges and then only issue partial writes to the mapped area (e.g. write one page, skip one page, write one page, etc.). That results in many small mmap write requests in GPFS, making it more likely to fill the asynchronous queue. |
| Workaround |
Reducing the mmap workload might reduce the risk of the leak, but there is no guarantee. |
|
5.2.3.4 |
All Scale Users |
| IJ55169 |
High Importance
|
Typically mmchmgr should not be run while mmfsck is in-progress. But this was not being handled in the code and that led to long waiters as mmchmgr had to wait for mmfsck to complete and this caused long waiters and other commands to queue up behind mmchmgr.
(show details)
| Symptom |
Hang/Deadlock/Unresponsiveness/Long Waiters |
| Environment |
ALL Operating System environments |
| Trigger |
Run mmchmgr while mmfsck is in progress. |
| Workaround |
Do not run mmchmgr while mmfsck is running. |
|
5.2.3.4 |
mmchmgr |
| IJ55719 |
High Importance
|
GPFS failed the assertion "(quotaFlags & QUOTA_UNKNOWN_FLAGS) == 0" while extending the file audit log file in a quota enabled file system.
(show details)
| Symptom |
GPFS daemon fails with the subject assertion. |
| Environment |
ALL Operating System environments |
| Trigger |
When extending audit log file, the quotaFlags were not initialized leading to assertion. |
| Workaround |
None |
|
5.2.3.4 |
File audit logging and quotas |
| IJ55709 |
High Importance
|
Customers will see errors for mmuserauth create if AD password starts with hyphen and given from command line.
(show details)
| Symptom |
Error output/message |
| Environment |
Linux Only |
| Trigger |
Customer having AD password starting with hyphen and trying to create mmuserauth from command line. |
| Workaround |
Customer can pass a password that starts with hyphen to mmuserauth using password file. |
|
5.2.3.4 |
CES |
| IJ55720 |
High Importance
|
With simplified setup for file system encryption, when the KMIP client and server certificates are signed by the CAcertificate chains that have common certificates (e.g., same CA root and possibly intermediate certificates), the'mmkeyserv client register' command fails with error 71, as a result of the GKLM server returning a generic errorcode--instead of the expected one--in its response to the mmkeyserv command
(show details)
| Symptom |
Failure to register a mmkeyserv client. |
| Environment |
AIX, Linux |
| Trigger |
The use of KMIP client and server certificates signed by CA certificates chains with shared certificates. |
| Workaround |
Either (1) use self-signed, system generated KMIP client certificates; or (2) invoke the 'mmkeyserv client register'command again (the GKLM server will return the expected return code in its response to the mmkeyser command). |
|
5.2.3.4 |
GPFS Core |
| IJ55569 |
Critical |
In rare cases, an encrypted file system may panic and get unmounted as a result of unnecessarily checking a directory inode for en encryption context.
(show details)
| Symptom |
Cluster/File System Outage |
| Environment |
ALL Operating System environments |
| Trigger |
Users creating many files in directories in encrypted file systems from many nodes in the cluster, may trigger a special code path that mishandles the accessing of such directories when a node tried to become a metanode for such directories inodes. |
| Workaround |
Disable the stat cache by setting maxStatCache=0 |
|
5.2.3.4 |
Encryption |
| IJ55721 |
High Importance
|
Daemon can hit a logAssert, resulting in the daemon recycle.
(show details)
| Symptom |
frequent inode space expansion messages in mmfs.log |
| Environment |
ALL Operating System environments |
| Trigger |
A file create workload that results in inode space expansion can trigger the problem. |
| Workaround |
perform manual inode-space expansion of all the inode spaces. |
|
5.2.3.4 |
File creation/Inode allocation. |
| IJ55865 |
HIPER |
The security advisory RHSA-2025:15668 by Red Hat for RHEL 9.4 includes a kernel upgrade to kernel version5.14.0-427.88.1.el9_4. This update introduces an incompatibility with IBM Storage Scale's kernel modulesfor existing IBM Storage Scale levels and causes a node to crash on the startup of Scale.
(show details)
| Symptom |
On startup, IBM Storage Scale causes an access violation in the kernel that results in a node crash and reboot. |
| Environment |
All Linux platforms supporting Scale on RHEL 9.4: x86_64, s390, ppc64le, aarch64 |
| Trigger |
The issue is hit when the RHEL 9.4 kernel is upgraded to 5.14.0-427.88.1.el9_4 and IBM Storage Scale is nothaving the latest fix level. |
| Workaround |
There is no workaround available other than either installing an efix or reverting the kernel to a supported level < 5.14.0-427.88.1.el9_4. |
|
5.2.3.4 |
Scale as a whole - node crashes on startup. |
| IJ55891 |
Medium Importance |
GPFS daemon assert going off:Assert exp(getInfoGeoExtensionLen>0),at 8941 nsdDiskConfig.C, resulting GPFS daemon crash.
(show details)
| Symptom |
Assert |
| Environment |
Linux Only |
| Trigger |
This is triggered by a race condition between thread which fetches the extended attributes nsdMsgGetInfoX and a Resign thread which tries to close the NSD results in the Disk Geometry pointer geoP to get set to NULL whichcauses the Assert. |
| Workaround |
None |
|
5.2.3.4 |
GNR |
| IJ55285 |
High Importance
|
File audit log will start leaking memory, leading to mmfsd causing an OOM event
(show details)
| Symptom |
abend/crash when oom |
| Environment |
Linux Only |
| Trigger |
in order for the memory leak to happen, the customer has to intentionally remove audit logs that are still being used by file audit logging |
| Workaround |
prevent deleting audit logs that are still being appended to by file audit logging |
|
5.2.3.4 |
file audit logging |
| IJ54571 |
High Importance
|
Race between token revoke and buffer steal could lead to GPFS daemon to fail with signal 11 or assert. It could also lead to unexpected FSSTRUCT error to be issued.
(show details)
| Symptom |
Abend/Crash |
| Environment |
ALL Operating System environments |
| Trigger |
Concurrent operation on same directory from multiple nodes |
| Workaround |
None |
|
5.2.3.4 |
All Scale Users |
| IJ55893 |
High Importance
|
When running IOR Hard Write on hundreds of client nodes, with data shipping enabled, theserver nodes may get into such long waiters at the end of the run, preventing the application from ending.
(show details)
| Symptom |
Long Waiters |
| Environment |
All platforms |
| Trigger |
Lots of client nodes used in an IOR Hard Write run and with data shipping enabled |
| Workaround |
None |
|
5.2.3.4 |
Data Shipping |
| IJ55894 |
High Importance
|
On Linux system, upgrading Storage Scale to a newer version could partially fail if the GPFS kernel module was not completely unloaded resulting in the node having different versions of gpfs.base and gpfs.gpl packages. In such case, the mmbuildgpl command can still succeed which results in unexpected behavior. The mmbuildgpl command must check for the same version of gpfs.base and gpfs.gpl packages before proceeding with the build.
(show details)
| Symptom |
Unexpected Results/Behavior |
| Environment |
ALL Linux OS environments |
| Trigger |
Upgrade Scale to a new version while the kernel module is not completely unloaded and then run mmbuildgpl to build the portability layer of the newly installed version. |
| Workaround |
Install the correct version of gpfs.base and gpfs.gpl packages. |
|
5.2.3.4 |
All Scale Users |
| IJ56314 |
Suggested |
Cluster quorum loss may occur when re
(show details)
| Symptom |
Cluster shuts down |
| Environment |
All |
| Trigger |
non-quorum ESS BB server reporting communication problems with quorum node partern. |
| Workaround |
None |
|
5.2.3.4 |
GPFS Core |
| IJ55680 |
Medium Importance |
When client nodes leaves or rejoins the cluster, it is causing lock contention due to unnecessary server disk operations.
(show details)
| Symptom |
Performance degradation. |
| Environment |
Linux Only |
| Trigger |
client nodes leaves or rejoins the cluster in a large cluster. |
| Workaround |
None |
|
5.2.3.4 |
GNR/NSD |
| IJ56058 |
High Importance
|
Even when application is closing the files created with O_TMPFILE flag, these files do not get cleaned up from Scale cache. As a result, when cache goes beyond “maxFilesToCache” value, AsyncStealWorkerThread is triggered to clean up entries from cache. But since these O_TMPFILEs still have a valid VFS reference, this thread is unable to remove them from the cache. This causes AsyncStealWorkerThread to run continuously, causing mmfsd CPU usage to go up. Customer may always end up seeing “AsyncStealWorkerThread” in the ‘mmdiag —waiters” output.
(show details)
| Symptom |
Performance Impact/Degradation |
| Environment |
ALL Operating System environments |
| Trigger |
Creating lots of files with O_TMPFILE flag that exceeds Scale cache limit (as it is described above in problem description) |
| Workaround |
Clean dentry cache using command- "echo 3 > /proc/sys/vm/drop_caches" |
|
5.2.3.4 |
Core |
| IJ56059 |
High Importance
|
mmsysmon daemon wrongly computes length of multibyte unicode chars, e.g. "Münster" as city name.
This leads to receiving more bytes than expected per mmsysmon UDS protocol if any non-ASCII chars are being sent.
This leads to e.g. SSS call home failover not working at all if non-ASCII chars are a part of call home config.
In mmhealth this could lead to inconsistently reproducible errors if any Scale entities (e.g. fileset names) are using non-ASCII characters.
(show details)
| Symptom |
Component Level Outage |
| Environment |
ALL Linux OS environments |
| Trigger |
Using non-ASCII characters for any settings of Scale. |
| Workaround |
Not using non-ASCII characters for any settings of Scale. |
|
5.2.3.4 |
Callhome, System Health |
| IJ56060 |
Suggested |
When applications use both mmap and Direct I/O (DIO) on the same file concurrently, a deadlock can occur due to conflicting byte-range locks (brLock) taken by the two access paths. Once the deadlock occurs, I/O operations on the affected file hang indefinitely and cannot be recovered without restarting GPFS services.
(show details)
| Symptom |
Applications or system threads hang on file operations. GPFS trace logs show page fault handlers, DIO threads, and brLock waiting on each other in a cycle. No progress is made until GPFS services are restarted. |
| Environment |
ALL OS environments |
| Trigger |
This issue affects customers whose applications mix use of memory-mapped I/O (mmap) and Direct I/O (O_DIRECT) to the same file. The problem occurs when:
An application opens a file with mmap (shared, writable) and accesses pages through normal memory operations.
At the same time, another process or thread issues Direct I/O operations (O_DIRECT reads or writes) to the same file.
Both access paths attempt to lock overlapping byte ranges in the file, resulting in a deadlock.
Once the deadlock occurs, all further I/O to the affected file hangs indefinitely until GPFS is restarted. |
| Workaround |
None |
|
5.2.3.4 |
All Scale Users |
| IJ56063 |
High Importance
|
QoS statistics reporting did not correctly handle the --seconds parameter. Instead of displaying statistics for the specified number of seconds relative to the current clock time, it was showing data relative to the most recent statistics cached in daemon memory.
(show details)
| Symptom |
Unexpected Results/Behavior |
| Environment |
ALL Operating System environments. |
| Trigger |
Displaying QoS statistics after a period of inactivity. |
| Workaround |
None |
|
5.2.3.4 |
QoS. |
| IJ56143 |
High Importance
|
Async IO for restricted cases, requiring a fallback to compat mode are not possible with GDS resulting in an error for the IO request.
(show details)
| Symptom |
error |
| Environment |
Linux Only |
| Trigger |
Trigger an async IO for restricted cases from a GDS client, e.g.: replicated files |
| Workaround |
None |
|
5.2.3.4 |
GDS |
| IJ56144 |
Suggested |
1. dir-list-file prefetch should spill out bad directories to failed list when --enable-failed-file-list is passed.
2. Prefetch determines list file to be home-list-file because of logical comparison failure.
(show details)
| Symptom |
Unexpected behavior. |
| Environment |
All OS environments. |
| Trigger |
1. Dir-list-file prefetch with wrong directories in the list.
2. A case of prefetch using list file where cache and homepath are exactly the same. |
| Workaround |
None |
|
5.2.3.4 |
AFM |
| IJ51457 |
High Importance
|
When running file audit logging, audit events are written to Audit Log Files in the Audit Log fileset (.audit_log, by default). If these files are compressed while file audit logging is actively writing to them, the active files in Audit Log that are being written to can become corrupted and unrecoverable. Audit files are compressed automatically after mmfsd is done writing to them, and the audit records are not intended to be compressed before mmfsd is done writing to them. Because file system activity is usually present all the time, it is likely that there will be an active audit log that is being written to on each node at any point in time.
(show details)
| Symptom |
corruption |
| Environment |
Linux Only |
| Trigger |
A “file system struct error” will be triggered by the issue, and compression or decompression will fail against the affected audit records.
The /var/log/messages file (or the output of the errpt command on AIX) might contain an entry similar to the following:
Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=12662454: Invalid disk data structure. Error code 113 |
| Workaround |
Do not compress audit logs manually |
|
5.2.3.4 |
File Audit Logging |
tr class="fix">
IJ56160 |
Critical |
File system manager node could fail unexpectedly with assert exp((indIndex & 0xFF00000000000000ULL) == 0) in IndDesc.h.
This could happen when expanding number of allocated inode on a file system with very high number of allocated inode already.
(show details)
| Symptom |
Abend/Crash |
| Environment |
ALL Operating System environments |
| Trigger |
Increasing number of allocated inode |
| Workaround |
Avoid creating new independent fileset and increasing number of allocated inode. |
|
5.2.3.4 |
All Scale Users |
| IJ56251 |
HIPER |
When submitting an aio request, a data structure is still accessed after queueing and potentially completing the aio request. That results in access of already freed memory. This goes unnoticed in many cases, unless the workload is very high and the freed memory is immediately reused. In that case, it will result in a kernel crash.
(show details)
| Symptom |
Abend/Crash |
| Environment |
ALL Linux OS environments |
| Trigger |
This problem is hit when running Scale 5.2.3 or higher with a high aio workload. This happens during "goodpath" I/O, no error conditions need occur. |
| Workaround |
The problem has been introduced in Scale 5.2.3. Unless the fix is applied, one way to avoid this problem is to stay on a Scale level lower than 5.2.3. Reducing the I/O workload might avoid this problem, but this cannot be guaranteed. |
|
5.2.3.4 |
All Scale Users |
| IJ56313 |
High Importance
|
mmfsck and mmfsckx cannot detect and repair if there is corruption in directory such that it has CDITTOs in them.
(show details)
| Symptom |
FSSTRUCTs |
| Environment |
All |
| Trigger |
Unknown |
| Workaround |
None |
|
5.2.3.4 |
mmfsck and mmfsckx |
| IJ54955 |
High Importance
|
GPFS daemon could fail with assert unexpectedly during file repair. This could happen when there is a race between file repair and indirect block updates on the same file.
(show details)
| Symptom |
Abend/Crash |
| Environment |
ALL Operating System environments |
| Trigger |
Run file repair via mmrestripefile, mmrestripefs, mmchdisk, etc |
| Workaround |
None |
|
5.2.3.4 |
All Scale Users |
| IJ55915 |
Medium Importance |
An SGPanic() could be triggered various reasons. Some of the reasons are:
- when a node failed internally and left the cluster
- when metadata writes fails (with err =5, EIO) or hit OOS (Out-Of-Space) condition
- other critical failures
As part of S.311931, there were changes made in the StripeGroup::handleForceUnmount().The changes were to prevent the file system from being remounted when SGPanic() is triggered for OOS (Out-Of-Space) condition but had side-effect (which is presented in this defect).
(show details)
| Symptom |
File system is not being remounted upon SGPanic() |
| Environment |
Linux and AIX |
| Trigger |
SGPanic() |
| Workaround |
A user mount the file system manually with mmmount command. |
|
5.2.3.4 |
file system mount and {re,un}mount |
| IJ55192 |
Critical |
In rare cases, an encrypted file system may panic and get unmounted as a result of a directory inode being ....
(show details)
| Symptom |
Cluster/File System Outage |
| Environment |
ALL Operating System environments |
| Trigger |
Users creating many files in directories in encrypted file systems from many nodes in the cluster, may trigger a special code path that mishandles the accessing of such directories when a node tried to become a metanode for such directories inodes. |
| Workaround |
Disable the stat cache by setting maxStatCache=0 |
|
5.2.3.3 |
Encryption |
| IJ54956 |
High Importance
|
During file sharing, missing node information led to a crash. (is the problem description too loose? its almost the same as the title, but less technical)
(show details)
| Symptom |
Abend/Crash |
| Environment |
ALL Linux OS environments |
| Trigger |
The file accessed from a remote cluster has an access control list. |
| Workaround |
Adding information to the communication context about the nodes grant access and access the files. |
|
5.2.3.3 |
Remote cluster mount/UID remapping |
| IJ54792 |
High Importance
|
Unable to add new disk with thin provisioning when attempted with running mmadddisk. The command failed with 'Disk 'PRD_ABITEST14_01' mismatch, it doesn't support 'UNMAP' to reclaim space.'.
(show details)
| Symptom |
The command will fail with the error message like, "Disk 'PRD_ABITEST14_08' mismatch, it is not allowed to have both thin and non-thin disks in the system pool.". |
| Environment |
Linux Only |
| Trigger |
Run mmadddisk with a stanza file (contains a (list of) NSD(s). The stanza file would have thinDiskType={scsi | auto}. If the sysfs attribute (rotational, which is typically located at the queue (ex, /sys/devices/virtual/block/dm-1/queue/rotational)) for the disk(s) is(are) 0 indicating SSD, a different thinDiskType (i.e., nvme) will be returned which is a 'mismatch'. |
| Workaround |
None |
|
5.2.3.3 |
thin-provisioning. |
| IJ54254 |
Critical |
Lookup on a directory could stuck in a endless loop if there is a directory block with corruption.
(show details)
| Symptom |
Hang/Deadlock/Unresponsiveness/Long Waiters |
| Environment |
ALL Operating System environments |
| Trigger |
Lookup on a directory with corrupted block |
| Workaround |
Run offline mmfsck on the file system to repair any directory corruption. |
|
5.2.3.3 |
All Scale Users |
| IJ55321 |
Critical |
When a Direct IO is performed on an AFM uncached file the DIO path skips AFM caching needed path if dio is desired. This causes Data to appear corrupted (all 0s).
(show details)
| Symptom |
Data Corruption |
| Environment |
All OS Environments |
| Trigger |
Direct IO Read on an AFM uncached file. |
| Workaround |
mmchconfig dioDisable=1 -i |
|
5.2.3.3 |
AFM |
| IJ55350 |
High Importance
|
Filesystem level migration, or any multiple Fileset(s) filesystem needs checkDirty, checkUncached to be able to run at the FS level for complete checks. Also need an enhancement to mention -s similar to -g for all policy invocations of AFM.
(show details)
| Symptom |
Unexpected Behavior |
| Environment |
All OS Environments |
| Trigger |
Running mmafmctl checkDirty and checkUncached commands at the Filesystem level where AFM filesets are present. |
| Workaround |
Running at individual fileset level only. |
|
5.2.3.3 |
AFM |
| IJ54084 |
Suggested |
In a file system configured with the (default) "relatime" setting, if nodes only read files (but not write to them), while others stat those files, stat() will not provide an updated value for atime. This will affect applications that count on updated atime to determine whether files have been accessed recently.
(show details)
| Symptom |
Unexpected Results/Behavior |
| Environment |
ALL Operating System environments |
| Trigger |
The file system is created with the -S set to its default value ("relatime"). Applications read the content of files, but seldom write to them. Applications with perform 'stat' on the files run on nodes which are not the nodes where the reads take place. |
| Workaround |
Set the (undocumented) forceAttributeRefresh configuration parameter, which will force nodes to retrieve updated stat info. For example: mmchconfig forceAttributeRefresh=60 -i |
|
5.2.3.3 |
ALL Operating System environments |
| IJ55304 |
High Importance
|
The assert can sometimes happen due to token reference count leak.
(show details)
| Symptom |
Abend/Crash |
| Environment |
ALL Operating System environments |
| Trigger |
When token transfer goes through certain code path |
| Workaround |
Disable the assert may be OK in most cases. |
|
5.2.3.3 |
All Scale Users |
| IJ55373 |
High Importance
|
On a large file system, tsapolicy may not free all queue elements processed during directory scan and could result in OOM. During directory scan, tsapolicy allocates queue elements to keep each directory entry for scanning and assigns a unique correlation number to each queue element. The correlation number is used as a watermark for the server process to free queue elements that have been processed when a client completes its assignment. But it is an int32 type and can be overflowed in a large file system. The number needs to be reset to 1 when it overflows int32.
(show details)
| Symptom |
Component Level Outage |
| Environment |
all platforms that support mmapplypolicy |
| Trigger |
run mmapplypolicy on a large file system |
| Workaround |
none |
|
5.2.3.3 |
mmapplypolicy |
| IJ55376 |
High Importance
|
A check is called too often which can be problematic when many discs are checked leading to waiters affecting IO performance.
Check /var/adm/ras/mmsysmonitor.log for
[I] Timeout RunCmd Command /usr/lpp/mmfs/bin/mmremote getLocalNsdData -X timed out after 42 sec. Sending SIGTERM and /var/adm/ras/mmfs.log for "waiters"
(show details)
| Symptom |
Error output/message Slow IO |
| Environment |
ALL Operating System environments |
| Trigger |
A check is called too often which can be problematic when many discs are checked |
| Workaround |
As a quick fix the check can be disabled using mchconfig mmhealth-disk-check_nsd=False --force after update set parameter to True to re-enable the check |
|
5.2.3.3 |
System Health |
| IJ55379 |
High Importance
|
The timeout test result is not consistent on AMD EPYC-Turin Processor. If the test passes, the GSKIT hangs workaround will not be applied. This causes problem later
(show details)
| Symptom |
Installation and admin commands hang. |
| Environment |
Linux OS environments |
| Trigger |
This problem affects AMD EPYC-Turin. |
| Workaround |
Manually apply the workaround |
|
5.2.3.3 |
Admin Commands, gskit |
| IJ55381 |
Suggested |
mmfs.log: logAssertFailed: maxExpellableQuorumNodes>=0
(show details)
| Symptom |
Abend/Crash |
| Environment |
All |
| Trigger |
Cluster manager trying to process an expel request after quorum has been lost. |
| Workaround |
None |
|
5.2.3.3 |
GPFS Core |
| IJ55406 |
Suggested |
The 'mmkeyserv tenant delete' command fails to remove the tenant definition from the Storage Scale cluster when the tenant no longer exists on the key server.
(show details)
| Symptom |
Command failure |
| Environment |
AIX and Linux |
| Trigger |
The tenant was removed from the key server prior to invoking the 'mmkeyserv tenant delete' command. This occurs on new version of GKLM. |
| Workaround |
Reissue the command with --force option. |
|
5.2.3.3 |
Admin Commands Encryption |
| IJ55407 |
High Importance
|
An IO operation from NFS Ganesha may not always have the client ip address information. With FAL and NFS Ganesha enabled, if a previous NFS IO operation for a file system had the client IP address information associated with it, this IP can potentially be used to in audit events for another NFS IO operation of another filesystem, resulting in a mismatch of NFS client IPs in events of different file systems.
(show details)
| Symptom |
Unexpected Results/Behavior |
| Environment |
Linux |
| Trigger |
- Enable File Audit Logging for at least two file systems
- Have NFS Ganesha running
- Have at least two NFS clients that mount the exports of the file systems to the same CES node.
- Generate IOs from the clients to the exports.
- A small set of events in the audit logs of each file system would contain incorrect/mismatch nfs client ips. |
| Workaround |
None |
|
5.2.3.3 |
File Audit Logging, NFS |
| IJ55408 |
High Importance
|
During AFM migration, files deleted on the target are not being removed from the local cache if the parent is dirty. "mmafmctl checkUncached" command reports these files are uncached.
(show details)
| Symptom |
Unexpected results. |
| Environment |
All OS Environments |
| Trigger |
AFM migration with deleted files/dirs at the target |
| Workaround |
None |
|
5.2.3.3 |
AFM |
| IJ55647 |
Critical |
Files from the AFM cache may be incorrectly deleted or moved to the .ptrash directory when using the afmFastCreate option. A file may be incorrectly deleted from the cache when a newly created file is renamed.
(show details)
| Symptom |
Unexpected Results |
| Environment |
All OS Environments |
| Trigger |
Using afmFastCreate option with AFM caching. |
| Workaround |
Disable afmFastCreate option |
|
5.2.3.3 |
AFM |
| IJ55648 |
Critical |
A deadlock may occur in the AFM environment when afmFastLookup is disabled, due to a lock ordering issue. This can lead to cluster-wide hangs.
(show details)
| Symptom |
Deadlock |
| Environment |
All OS Environments |
| Trigger |
AFM caching under high workload |
| Workaround |
Enable afmFastLookup option |
|
5.2.3.3 |
AFM |
| IJ55409 |
High Importance
|
Parallel Read considers only unique Remote site mapped Gateway nodes for spawning READ_SPLIT messages except for GPFS backend. Same should be considered for Object backend also because all Gateway nodes will be mapped to the same remote target.
(show details)
| Symptom |
Unexpected behavior |
| Environment |
Linux Only |
| Trigger |
Read on large object on the AFM COS backend with a mapping target |
| Workaround |
None |
|
5.2.3.3 |
AFM |
| IJ55188 |
Suggested |
If the utimensat() system call (which is used in the "touch" command) is issued on a file shortly after it was issued, it may not take effect and fail to update the file's modification time ("mtime").
(show details)
| Symptom |
Unexpected Results/Behavior |
| Environment |
ALL Linux OS environments |
| Trigger |
Issuing the utimensat() system call on a given file multiple times in quick succession (< 1 second). |
| Workaround |
Whenever feasible, wait at least 1 second between subsequent invocations of utimensat() on the same file. |
|
5.2.3.3 |
All Scale Users |
| IJ55601 |
Medium Importance |
A previous network issue can lead into a log/trace entry like
totalReceived == scatteredP->scattered_total_len || (totalReceived == 0 && scatteredIndex == scatteredP->scattered_count) followed by an assert causing the node to stop.
(show details)
| Symptom |
Assert/Crash |
| Environment |
Linux Only |
| Trigger |
A network failure triggering some state counters/variables in an undefined state |
| Workaround |
None |
|
5.2.3.3 |
Scale |
| IJ55167 |
High Importance
|
IBM has identified potential security leak or data access loss issue for files created from SMB clients. The issue may appear when SMB clients create files in folders that use ACL inheritance to change ACLs (additional access to groups, reduced access to a users primary group) from the default access mask.
(show details)
| Symptom |
incorrect ACL written |
| Environment |
Linux Only |
| Trigger |
File creation via SMB protocol in folders with ACL inheritance |
| Workaround |
None |
|
5.2.3.2 |
CES SMB |
| IJ55170 |
High Importance
|
This issue often shows up when running git clone into an NFS-mounted directory.
Below is an example of the error that may occur:
$ git clone https://github.com/jupp0r/prometheus-cpp
Cloning into 'prometheus-cpp
'...remote: Enumerating objects: 5577, done.
remote: Counting objects: 100% (1562/1562), done.
remote: Compressing objects: 100% (373/373), done.
remote: Total 5577 (delta 1287), reused 1189 (delta 1189), pack-reused 4015 (from 2)
Receiving objects: 100% (5577/5577), 1.32 MiB | 7.94 MiB/s, done.
fatal: could not open '/mnt/nfs4/prometheus-cpp/.git/objects/pack/tmp_pack_fADmRg' for reading: Permission denied
fatal: fetch-pack: invalid index-pack output
(show details)
| Symptom |
Permission denied error |
| Environment |
Linux Only |
| Trigger |
Permission denied error encountered during git clone |
| Workaround |
None |
|
5.2.3.2 |
NFS |
| IJ55184 |
High Importance
|
Scale 5.2.3 PTF1 and PTF2 contain a code change leading to possible slower read performance.
The problem exists in 5.2.3 PTF1 and the fix is in 5.2.3 PTF2
(show details)
| Symptom |
Slower performance than expected. |
| Environment |
ALL Linux OS environments and "Windows/x86_64" |
| Trigger |
Any regular read workload can incur additional overhead. This has been specifically observed with the ior hard read benchmark, but could affect any read workload. |
| Workaround |
Do not use Scale 5.2.3 PTF1 or PTF2. |
|
5.2.3.2 |
All Scale Users |
| IJ54628 |
High Importance
|
Not able to read uncached file during Resync when AFM queue is on queueOnly state.
(show details)
| Symptom |
Uncached file read failure during Resync |
| Environment |
Linux Only |
| Trigger |
Read on uncached file while AFM resync is queueing ops on gateway node. |
| Workaround |
None |
|
5.2.3.1 |
AFM |
| IJ53214 |
High Importance
|
With FAL and NFS Ganesha enabled, running workloads with path to an NFS export for long periods of time could result in NFS client ips not being logged in the audit log.
(show details)
| Symptom |
Unexpected Results/Behavior |
| Environment |
Linux |
| Trigger |
- Wit FAL and NFS Ganesha enabled, run workloads with path to the NFS mount point for long periods of time |
| Workaround |
- Restart NFS Ganesha if NFS client ips are not being logged |
|
5.2.3.1 |
File Audit Logging, NFS |
| IJ54629 |
High Importance
|
mmrestorefs recreates all files and directories that were deleted after the snapshot was taken.If the deleted file is a special file, mmrestorefs uses mknod() system call to create the file.But mknod() cannot create a socket file on AIX. Hence, if socket files were deleted after the snapshot was taken,mmrestorefs on AIX will fail during re-creating the socket file.
(show details)
| Symptom |
Component Level Outage |
| Environment |
AIX only |
| Trigger |
run mmrestorefs when a socket file was deleted after the snapshot was taken. |
| Workaround |
none |
|
5.2.3.1 |
mmrestorefs |
| IJ54802 |
High Importance
|
If mmrestripefs command issued while mmreclaimspace command is running, we could expect the assert.
(show details)
| Symptom |
Abort |
| Environment |
Linux / AIX |
| Trigger |
On a file system with thin-provisioning (or space-reclamation only as well) enabled, run mmrestripefs while mmreclaimspace command running for space-reclamation |
| Workaround |
None |
|
5.2.3.1 |
space-reclamation |
| IJ54804 |
High Importance
|
Weighted RGCM Log group rebalance issue.
(show details)
| Symptom |
Abend |
| Environment |
Linux Only |
| Trigger |
If we have slightly different log groups weights on the same server which might not balance the heavy weighted log groups. |
| Workaround |
None |
|
5.2.3.1 |
ESS/GNR |
| IJ54783 |
High Importance
|
When trying to install Storage Scale on Windows with latest Cygwin version (3.6.1), the installation can fail due to security issues.
(show details)
| Symptom |
Upgrade/Install failure. |
| Environment |
Windows/x86_64 only |
| Trigger |
Upgrading Cygwin to version 3.6.1 before trying to install Storage Scale on Windows |
| Workaround |
Downgrade Cygwin to version 3.6.0 or below before attempting to install Storage Scale on Window |
|
5.2.3.1 |
Install/Upgrade |
| IJ53557 |
High Importance
|
GPFS asserted due to unexpected hold count on events exporter object during destructor.
(show details)
| Symptom |
Assert |
| Environment |
All platforms |
| Trigger |
A race condition between EventsExporterReceiverThread and EventsExporterListenThread and an error path where the destructor is called |
| Workaround |
None |
|
5.2.3.1 |
All Scale Users |
| IJ54868 |
Suggested |
'(' character in the undefined value for default needs to be escaped. Failing which the propogation of theconfig to other nodes seems to throw error as syntax error.
(show details)
| Symptom |
Unexpected Behavior |
| Environment |
All OS Environments |
| Trigger |
Tune the afmRecoveryDir back to its default value. |
| Workaround |
None |
|
5.2.3.1 |
AFM |
| IJ54878 |
High Importance
|
If the dependent fileset is created as a non-root user and linked, then the uid/gid are not replicated for the dependent fileset to the remote site.
(show details)
| Symptom |
Unexpected Behavior |
| Environment |
Linux Only |
| Trigger |
Create and Link dependent fileset inside DR primary fileset as a non-root user. |
| Workaround |
None |
|
5.2.3.1 |
AFM |
| IJ54968 |
High Importance
|
opening a new file with O_RDWR|O_CREAT fails with EINVAL.
(show details)
| Symptom |
file creation returns an error of EINVAL> |
| Environment |
Linux Only |
| Trigger |
Unknown |
| Workaround |
None |
|
5.2.3.1 |
Scale Core |
| IJ54967 |
High Importance
|
crash during cxiStrcpy in setSecurityXattr
(show details)
| Symptom |
Crash |
| Environment |
Linux Only |
| Trigger |
file creation with selinux enabled. |
| Workaround |
None |
|
5.2.3.1 |
Scale core |
| IJ54966 |
High Importance
|
Kernel Crash with selinux enabled
(show details)
| Symptom |
Crash |
| Environment |
Linux Only |
| Trigger |
file creation with selinux enabled. |
| Workaround |
None |
|
5.2.3.1 |
Scale core |
| IJ54965 |
High Importance
|
NFSV4 ACLs are not replicated with AFM fileset level options afmSyncNFSV4ACL and afmNFSV4
(show details)
| Symptom |
Unexpected results |
| Environment |
Linux Only |
| Trigger |
Using options afmSyncNFSV4ACL and afmNFSV4 to replicate NFSv4 ACLs. |
| Workaround |
None |
|
5.2.3.1 |
AFM |
| IJ54963 |
High Importance
|
Symlinks are appended with a null character, which causes the pwd -P command to fail to resolve the real path.
(show details)
| Symptom |
Unexpected results |
| Environment |
Linux Only |
| Trigger |
AFM caching with symlinks |
| Workaround |
None |
|
5.2.3.1 |
AFM |
| IJ54962 |
High Importance
|
Snapshots are not listed under .snapshots directory when the AFM is enabled on the file system
(show details)
| Symptom |
Unexpected results |
| Environment |
All OS environments |
| Trigger |
Listing snapshots when AFM is enabled on the file system |
| Workaround |
None |
|
5.2.3.1 |
AFM |
| IJ54975 |
Suggested |
"mmhealth cluster show" my report an additional GUI pod after upgrade or rebalancing.
(show details)
| Symptom |
Unexpected Results/Behavior |
| Environment |
Open Shift (CNSA) |
| Trigger |
CNSA Upgrade or other rebalancing action of GUI pods. |
| Workaround |
Moving the cluster manager node (mmchmgr) will ensure a resync of the data. "mmhealth node show -a --resend" will do the same |
|
5.2.3.1 |
System Health |
| IJ54976 |
High Importance
|
Nodes accessing the AFM fileset crash when the fileset is attempted to disable online with "mmchfileset -p afmTarget=disable-online" command
(show details)
| Symptom |
Crash |
| Environment |
Linux Only |
| Trigger |
AFM fileset disable-online |
| Workaround |
None |
|
5.2.3.1 |
AFM |
| IJ54983 |
High Importance
|
File Audit Logging uses an internal data structure to keep track of NFS client ip addresses for NFS IOs coming from Ganesha. The CES nodes can crash during garbage collection of this structure due to a race condition caused by a use-after-free error.
(show details)
| Symptom |
Abend/Crash |
| Environment |
Linux |
| Trigger |
- File audit logging is enabled on a file system with NFS Ganesha running. - Large amount of IOs running to NFS exports. |
| Workaround |
- Disable File Audit Logging, or - Avoid NFS IOs when FAL is enabled |
|
5.2.3.1 |
File Audit Logging, NFS |
| IJ54984 |
High Importance
|
hit Assert exp(getChildId().isValid()) during read operation if mmafmtransfer restarted
(show details)
| Symptom |
Crash |
| Environment |
Linux Only |
| Trigger |
While read is in queue kill mmafmtransfer daemon |
| Workaround |
None |
|
5.2.3.1 |
AFM |
| IJ54985 |
High Importance
|
On doing mmchdisk start if the command encounters a corrupted inode that fails inode validation then it would not produce the interesting inode list showing the bad inode number. due to this the user did not come to know the affected inode number and had to rely on long running traces to get this information.
(show details)
| Symptom |
pit.interestingInodes file not generated/populated |
| Environment |
ALL Operating System environments |
| Trigger |
When mmchdisk start encounters a corrupted inode |
| Workaround |
Capture long running traces and provide the traces to support for getting this infomation |
|
5.2.3.1 |
PIT |
| IJ54986 |
High Importance
|
Accessing a file through mmap while the same file is accessed on other nodes, or other operations are done on other nodes, there is a small chance of a race condition leading to a logassert
(show details)
| Symptom |
Abend/Crash |
| Environment |
ALL Linux OS environments |
| Trigger |
Access parts of a mmaped file initially on one node, while having concurrent access or concurrent operations on the same file on other nodes |
| Workaround |
Avoid the concurrent operations on other nodes while the file is accessed on one node |
|
5.2.3.1 |
All Scale Users |
| IJ54593 |
High Importance
|
During token minimization, a deadlock can occur on a client node. With token minimization, a client node is first asked to give up any tokens that are only for cached files. Without the fix, calling this codepath for files that have been deleted, could result in a deadlock.
(show details)
| Symptom |
Hang/Deadlock/Unresponsiveness/Long Waiters |
| Environment |
ALL Linux OS environments |
| Trigger |
Have many files cached on a client node. Delete files. Trigger a token server change, which then uses token minimization. |
| Workaround |
Disable token minimization to avoid the problem: mmchconfig tokenXferMinimization=no. Or restart GPFS on the client node, to get out of the deadlock. |
|
5.2.3.1 |
All Scale Users |
| IJ54987 |
High Importance
|
mmrestoreconfig restores file system configuration information which includes fileset information. When recreating AFM filesets, mmrestoreconfig tries to restore afmShowHomeSnapshot attributes but AFM does not allow to set afmShowHomeSnapshot attribute for IW cache mode fileset. Hence mmrestoreconfig will fail if there is an IW cache mode fileset.
(show details)
| Symptom |
Component Level Outage |
| Environment |
all platforms that support mmrestoreconfig |
| Trigger |
run mmrestoreconfig for a file system that contains IW cache mode fileset |
| Workaround |
none |
|
5.2.3.1 |
mmrestoreconfig |
| IJ54988 |
Critical |
This APAR is to minimize the severity of the issue experienced during the erroneously processing of a DMAPI recall. This APAR does not correct the underlying symptom, however, it reduces the impact for customers who experience this issue. The APAR provides additional diagnostics in trace as well as the Linux kernel console.
(show details)
| Symptom |
Customer who experienced the LogAssert (noted in the APAR title) will now receive a soft I/O error when trying to recall the file with a DMAPI 3rd party. |
| Environment |
RHEL8 (x86_64, Power) and RHEL9 (x86_64, Power, Z) |
| Trigger |
The initial problem was unable to be recreated in the lab. |
| Workaround |
None |
|
5.2.3.1 |
DMAPI |
| IJ54969 |
High Importance
|
kernel panic: general protection fault / ovl_dentry_revalidate_common / mmfsd ORrunning lsof /proc on a node crashes the node
(show details)
| Symptom |
Crash |
| Environment |
Linux Only |
| Trigger |
Running lsof /proc on a node crashes the node. |
| Workaround |
None |
|
5.2.3.1 |
Scale core |
| IJ54979 |
High Importance
|
With afmFastCreate enabled, if the Create that tries to push the initial chunk of data fails to complete and gets requeued, then the requeued Create is replaying all data when it retries.And later there are a couple of Write messages that starting from offset where Create initially went inflight that is also played. Totaling to almost twice the amount of data of the file size to be replicated.
(show details)
| Symptom |
Unexpected Behaviour |
| Environment |
All Linux OS Environments (AFM Gateway nodes) |
| Trigger |
afmFastCreate replication failing initially because of lock or network error and later replication being tried again. |
| Workaround |
Set a higher value of afmAsyncDelay to push replication as far as the file is being written. |
|
5.2.3.1 |
AFM |
| IJ54655 |
High Importance
|
By default, clusters created with version 5.2.0 or later have the numaMemoryInterleave value set to yes. This should start Storage Scale daemon with interleave memory policy, but it does not.
(show details)
| Symptom |
Performance Impact/DegradationUnexpected Results/Behavior |
| Environment |
ALL Linux OS environments |
| Trigger |
This issue affects customers running Storage Scale in Linux NUMA environment and the Storage Scale clusters created with version 5.2.0 or later. |
| Workaround |
Explicitly set numaMemoryInterleave=yes using mmchconfig command. # mmchconfig numaMemoryInterleave=yes |
|
5.2.3.1 |
All Scale Users |
| IJ55083 |
HIPER |
mmap data on Windows nodes may not be correctly written to disk on Windows node running Scale 5.1.9 PTF10
(show details)
| Symptom |
Data corruption |
| Environment |
Windows/x86_64 only |
| Trigger |
Write data from mmap applications on Windows. The data may not be written correctly to disk. |
| Workaround |
There is no workaround. The recommendation is to not run 5.1.9 PTF 10 on Windows nodes without this fix. |
|
5.2.3.1 |
All Scale Users |
| IJ55093 |
Critical |
Unexpected GPFS daemon assert could happen when file system has DMAPI enabled for use with DMAPI application
(show details)
| Symptom |
Abend/Crash |
| Environment |
ALL Operating System environments |
| Trigger |
File delete on DMAPI enabled file system trigger destroy event |
| Workaround |
Disable DMAPI on the file system |
|
5.2.3.1 |
DMAPI/HSM/TSM |
| IJ55094 |
Suggested |
When updating a resource via scalectl with the --url option, the update mask is not set, meaning the field might not get updated, or might result in validations being skipped
(show details)
| Symptom |
Unexpected Results/Behavior |
| Environment |
Linux Only |
| Trigger |
run scalectl <resource> update --url <host>:<port> |
| Workaround |
use scalectl without the --url option, or run a REST API request with the appropriate update mask |
|
5.2.3.1 |
Native Rest API |
| IJ55095 |
High Importance
|
mmafmctl getList subcommand deletes all .* files/dir in current working directory because of a variable initialisation issue in mmafmctl script.
(show details)
| Symptom |
Unexpected Behavior |
| Environment |
All OS Environments |
| Trigger |
Running the mmafmctl getList subcommand from an important work directory like /root where important OS related files might exist. |
| Workaround |
cd working directory to a empty dir in /tmp and run the mmafmctl getList subcommandfrom there. |
|
5.2.3.1 |
AFM |
| IJ55119 |
High Importance
|
If an accessing cluster has been authorized to access a list of filesets, updating resources on the owning cluster to remove one fileset is not effective. The original list of filesets can still be accessed. An accessing cluster may not access to remote resources after a resource update to remove and then re-add the resources.
(show details)
| Symptom |
Unexpected Results/Behavior |
| Environment |
Linux |
| Trigger |
Authorize access to fileset resources on the owning cluster via scalectl cluster remote authorize. Then remove a fileset via scalectl cluster remote update. Perform a series of authorize, remote mount, unauthorize, remote mount actions on file system resources. |
| Workaround |
Use mmauth to update resources |
|
5.2.3.1 |
Native REST API |
| IJ55120 |
Suggested |
If a remote file system definition is added using scalectl filesystem remote add, the Automount value may not be correct when viewing this definition with mmremotefs.
(show details)
| Symptom |
Unexpected Results/Behavior |
| Environment |
Linux |
| Trigger |
Add a remote file system definition via scalectl filesystem remote add. Display this file system definition via mmremotefs show. The value in the Automount column in the output of mmremotefs show shows 'mount = false' instead of 'no'. |
| Workaround |
None |
|
5.2.3.1 |
Native REST API |
| IJ55121 |
High Importance
|
In a resource definition file, if fileset resources are specified and there is no matching file system resource. This is invalid.If the root fileset is not specified when there are fileset resources to authorize, this is invalid. If the file system's disposition does not match its root fileset's disposition, this is invalid. scalectl cluster remote authorize may grant some resources instead of returning an error.
(show details)
| Symptom |
Unexpected Results/Behavior |
| Environment |
Linux |
| Trigger |
No matching fs resource for the fileset resources or, No root fileset specified in the fileset resources or,Mismatch dispositions between a file system and its root fileset. |
| Workaround |
Ensure the resource definition file grants the correct resources |
|
5.2.3.1 |
Native REST API |
| IJ55141 |
High Importance
|
If replica compare is done on block 0 of a snapshot inode0 file while the same block is being updated, a false positive replica mismatch can happen.
(show details)
| Symptom |
Replica mismatch is reported for block 0 of snapshot inode0 file |
| Environment |
All platforms |
| Trigger |
Doing replica compare and updating snapshot inode0 file at the same time |
| Workaround |
None |
|
5.2.3.1 |
core GPFS |
| IJ53815 |
High Importance
|
During or after upgrade of manager nodes to 5.2.0.0+, deadlock can occur.
(show details)
| Symptom |
Cluster/File System Outage |
| Environment |
All Operating System environments |
| Trigger |
Manager node(s) are on version 5.2.0.0+. Client nodes are running a release prior to 5.1.9.0, or morethan one file system is under token migration at the same time. |
| Workaround |
None |
|
5.2.3.0 |
All Scale Users |
| IJ53828 |
High Importance
|
If the customer executes the systemops command, it will allow any command to be executed, as there is no specific command validation in place.
(show details)
| Symptom |
None |
| Environment |
Linux Only |
| Trigger |
No such conditions |
| Workaround |
None |
|
5.2.3.0 |
No Such restriction |
| IJ54043 |
Medium Importance |
When a file system maintenance command - mmrestripefs - or disk maintenance command - mm{ch,del,rpl}disk - run, the 'thin inode' is deallocated and the emergency space is deleted. This is unexpected behavior and could be a problematic if the file system gets OOS (Out-Of-Space) condition.
(show details)
| Symptom |
After one of the commands run, the 'thin inode' in the internal dump will reset to -1 indicating that the 'thin inode' is deallocated and 'nBlocks' will be 0 indicating the emergency space is deleted.
[root@c145f11san04b sju]# mmfsadm dump stripe | egrep "State of|thin inode"
State of StripeGroup "test" at 0x18042A6A5B0, uid 7491A8C0:67D9B824, local id 1:
0: name 'system' Valid nDisks 32 nInUse 32 id 0 poolFlags 2 thin inode 41 nBlocks 519
1: name 'data' Valid nDisks 16 nInUse 16 id 65537 poolFlags 2 thin inode 42 nBlocks 526
2: name 'flash' Valid nDisks 8 nInUse 8 id 65538 poolFlags 2 thin inode 43 nBlocks 526
[root@c145f11san04b sju]# mmrestripefs test -R -N nc1
...
[root@c145f11san04b sju]# mmfsadm dump stripe | egrep "State of|thin inode
"State of StripeGroup "test" at 0x18042A6A5B0, uid 7491A8C0:67D9B824, local id 1:
0: name 'system' Valid nDisks 32 nInUse 32 id 0 poolFlags 2 thin inode -1 nBlocks 0
1: name 'data' Valid nDisks 16 nInUse 16 id 65537 poolFlags 2 thin inode -1 nBlocks 0
2: name 'flash' Valid nDisks 8 nInUse 8 id 65538 poolFlags 2 thin inode -1 nBlocks 0 |
| Environment |
Linux/AIX |
| Trigger |
Run one of the command - mmrestripefs or mm{ch,del,rpl}disk |
| Workaround |
None |
|
5.2.3.0 |
thin-provisioning |
| IJ54044 |
High Importance
|
Because of the limitation in the current implementation in managing reserved inode pool, 'thin inode' can be shared erroneously with policy file while it is still assigned to the emergency space on the file system with SSS6K and FCM4 drives. This will trigger an 'assert' because of the corruption by being shared the same inode. And, it could also the file system metadata corruption by itself which will have consequence its own.
(show details)
| Symptom |
After one of the commands run, the 'thin inode' in the internal dump will reset to -1 indicating that the 'thin inode' is deallocated and 'nBlocks' will be 0 indicating the emergency space is deleted.
[root@c145f11san04b sju]# mmfsadm dump stripe | egrep "State of|thin inode
"State of StripeGroup "test" at 0x18042A6A5B0, uid 7491A8C0:67D9B824, local id 1:
0: name 'system' Valid nDisks 32 nInUse 32 id 0 poolFlags 2 thin inode 41 nBlocks 519
1: name 'data' Valid nDisks 16 nInUse 16 id 65537 poolFlags 2 thin inode 42 nBlocks 526
2: name 'flash' Valid nDisks 8 nInUse 8 id 65538 poolFlags 2 thin inode 43 nBlocks 526
[root@c145f11san04b sju]# mmchpolicy test /home/sju/policy-default
...
[root@c145f11san04b sju]# mmchmgr test c145f11san04a
[root@c145f11san04b sju]# mmchpolicy test /home/sju/policy-default
[root@c145f11san04b sju]# mmfsadm dump stripe | grep "policy file inode"
policy file inode: 41 |
| Environment |
Linux/AIX |
| Trigger |
Run 'mmchpolicy' command multiple times. |
| Workaround |
None |
|
5.2.3.0 |
thin-provisioning |
| IJ54045 |
Medium Importance |
To help controlling the issues (refer to D.341360, D.343470, and D.343471) with file systems created from SSS6K and FCM4, a new option 'thininode' is added to the tsdbfs command. This option will be used to reset 'thin inode'.
(show details)
| Symptom |
With the command, 'thin inode' will be reset to the value as appeared in the tsdbfs command.
[root@c145f11san04b sju]# mmfsadm dump stripe | egrep "State of|thin inode"
State of StripeGroup "test" at 0x18042A6A5B0, uid 7491A8C0:67D9B824, local id 1:
0: name 'system' Valid nDisks 32 nInUse 32 id 0 poolFlags 2 thin inode 41 nBlocks 519
1: name 'data' Valid nDisks 16 nInUse 16 id 65537 poolFlags 2 thin inode 42 nBlocks 526
2: name 'flash' Valid nDisks 8 nInUse 8 id 65538 poolFlags 2 thin inode 43 nBlocks 526
[root@c145f11san04b sju]# tsdbfs test patch desc thininode 0 -1
[root@c145f11san04b sju]# tsdbfs test patch desc thininode 1 -1
[root@c145f11san04b sju]# tsdbfs test patch desc thininode 2 -1
[root@c145f11san04b sju]# mmfsadm dump stripe | egrep "State of|thin inode"
State of StripeGroup "test" at 0x18042A6A5B0, uid 7491A8C0:67D9B824, local id 1:
0: name 'system' Valid nDisks 32 nInUse 32 id 0 poolFlags 2 thin inode -1 nBlocks 519
1: name 'data' Valid nDisks 16 nInUse 16 id 65537 poolFlags 2 thin inode -1 nBlocks 526
2: name 'flash' Valid nDisks 8 nInUse 8 id 65538 poolFlags 2 thin inode -1 nBlocks 526 |
| Environment |
Linux/AIX |
| Trigger |
Refer to D.343470 and D.343471 on the issues/symptoms and how to trigger. |
| Workaround |
None |
|
5.2.3.0 |
thin-provisioning |
| IJ54079 |
High Importance
|
An application using SMB server may invoke the gpfs_stat_x() call (available in libgpfs.so) to retrieve stat information for a file.Such call implements "statlite" semantics, meaning that the size information is not assured to be the latest. Other applications which invoke standard stat()/fstat() calls do expect the size information to be up to date.However, due a problem in the logic, after gpfs_stat_x() is invoked, information is cached inside the kernel, and the cache is not purged even when other nodes change the file size (for example by appending data to it). The result is that stat() invoked on the node may still retrieve out of date file size information as other nodes write into the file.
(show details)
| Symptom |
Unexpected Results/Behavior |
| Environment |
ALL Operating System environments |
| Trigger |
SMB Applications invoking gpfs_stat_x() will cause wrong file size information to be retrieved by stat()/fstat() invoked by other applications. |
| Workaround |
None |
|
5.2.3.0 |
All Scale Users |
| IJ54328 |
Critical |
Incorrect snapshot data—either stale or uninitialized—may be read while the mmchdisk start command is being executed on file systems with replication enabled.
(show details)
| Symptom |
Data corruption, snapshot data read may not be as expected. |
| Environment |
All platforms |
| Trigger |
The issue may happen if some data replicas are stale or uninitialized, and snapshot data is accessed while running the mmchdisk start command to repair the bad replicas. |
| Workaround |
Avoid accessing snapshot data while running the mmchdisk start command. |
|
5.2.3.0 |
GPFS core |