| IJ57030 |
Suggested |
The mmlslicense command is still displaying discontinued FPO license designation even though there are no node desginated for FPO license. The mmchlicense and mmaddnode command still allow a node to be designated as FPO license.
(show details)
| Symptom |
Admin Commands |
| Environment |
All |
| Trigger |
Execute mmlslicense command. |
| Workaround |
None |
|
6.0.0.2 |
Admin |
| IJ55681 |
HIPER |
ernel crash in gpfsGetWinBasicInfo may happen in a build with the fix to the "stat deadlock with token transfer" issue.
(show details)
| Symptom |
Kernel crash on accessing a bad pointer, and with both gpfsGetWinBasicInfo and BaseMutexClass::release in the backtrace. |
| Environment |
ALL Operating System environments |
| Trigger |
Executing stat or an equivalent operation on the .snapshots or .fsetsnapshots directories, or performing a stat under certain error conditions such as a file system panic. |
| Workaround |
N/A |
|
6.0.0.2 |
All Scale Users |
| IJ57183 |
High Importance
|
The system rebooted unexpectedly while a driver was being removed during maintenance due to a timing issue in the driver cleanup process. No data loss occurred.
(show details)
| Symptom |
Upgrade/Install failure |
| Environment |
Linux Only |
| Trigger |
While unloading pemsmod, an interrupt handler ran and tried to read a register using ioread32() after the device / MMIO region was already freed or unmapped. This is a rare timing condition and does not affect normal system operation. |
| Workaround |
Do not manually unload pemsmod. |
|
6.0.0.2 |
ESS/GNR |
| IJ57024 |
High Importance
|
When using secure connections (cipherList=AES{128|256}-SHA256), the GPFS daemon may send data while the mutex lock protecting access to the secure connection is not held by the sending thread, resulting in the daemon asserting with the following message, in mmfs.log:
"[X] logAssertFailed: sconnP != __null"
(show details)
| Symptom |
GPFS daemon asserting. |
| Environment |
ALL Operating System environments |
| Trigger |
he problem may occur when secure connections are restarted as a result of spontaneous communication errors. |
| Workaround |
None |
|
6.0.0.2 |
All Scale Users |
| IJ57182 |
High Importance
|
Failed to download data and metadata of sub-dirs in MU after evicting uploaded directory.
(show details)
| Symptom |
Unexpected Results |
| Environment |
Linux Only |
| Trigger |
1. Create a directory inside MU mode fileset and add some files with data in directory
2. run a reconcile on MU fileset
3. run eviction on mu fileset
4. run download on newly created directory. |
| Workaround |
reset create bit on directory using command
mmafmctl fsName resetattr -j fsetName |
|
6.0.0.2 |
AFM |
| IJ57184 |
Suggested |
Incorrect stats for BytesToWrite, BytesToRead, Used Q-Memory may give wrong indication about writes are stuck, but it is not the case, only stats are incorrect.
(show details)
| Symptom |
Unexpected Stats interpretation |
| Environment |
All OS environments |
| Trigger |
Reset of afm statistics during ongoing IOs. |
| Workaround |
Reset the afm statistics |
|
6.0.0.2 |
AFM |
| IJ57227 |
High Importance
|
Outband download fails to download files from non-gateway or nodes other than command execution node.
(show details)
| Symptom |
Cos objects/files download failure |
| Environment |
All OS environments |
| Trigger |
Outband download from node other than node from where command is executed |
| Workaround |
Copy the objectlist on node same as -N arg. |
|
6.0.0.2 |
AFM |
| IJ52571 |
Medium Importance |
When the mmcrfs command is invoked without the -B (block size) option, and the block size is specified in the pool stanza but not for all data pools, the command creates the missing data pool descriptor using the default 4M data block size. If any existing pool stanza specifies a block size other than the default 4M, the file system will be created with inconsistent data block sizes. This can lead to logAssertFailed: rdwr.C line 1069: checkRangesValue >= 0
(show details)
| Symptom |
logAssertFailed. |
| Environment |
All |
| Trigger |
Creating a file system with multiple data storage pools that have different block sizes. |
| Workaround |
Add the missing pool stanza to the stanza file or use the -B option when creating the file system. |
|
6.0.0.2 |
File System |
| IJ57031 |
Suggested |
The health status of a fan in a JBOF extension is not shown.
(show details)
| Symptom |
If in a JBOF extension the coin battery has a problem e.g. low voltage or missing, no event is raised. |
| Environment |
all |
| Trigger |
Health state of JBOF - fans shown in mmlsenclosure are permanently suppressed by mmsysmon code. |
| Workaround |
None |
|
6.0.0.2 |
Health monitoring |
| IJ57233 |
HIPER |
If a user with execute permission on a directory performs an operation requiring execute permission (‘cd’ or ‘ls -d’) on that directory, it will then incorrectly grant execute permission to all users for an unspecified amount of time.
(show details)
| Symptom |
Unexpected Results/Behavior |
| Environment |
ALL Operating System environments |
| Trigger |
A directory with any number of extended entries needs to have all extended entries grant execute permission. Then, a user with execute permission on this directory needs to perform an operation requiring execute permission (e.g. 'cd' on the directory or 'ls -d' on a subdirectory). After this, any user will be able to perform an operation on the directory or subdirectory requiring execute permission on said directory for a period of time. |
| Workaround |
Add any extended entry to acl WITHOUT execute permission |
|
6.0.0.2 |
All Scale Users |
| IJ57341 |
Suggested |
The IPv4 ToS value for RDMA network traffic can be controlled via the configuration parameter verbsRdmaRoCEToS. Due to a incorrect implementation the configured value does not affect the RDMA network traffic.
(show details)
| Symptom |
Unexpected Results/Behavior |
| Environment |
ALL Linux OS environments |
| Trigger |
Setting the verbsRdmaRoCEToS configuration parameter to control the RDMA ToS value. |
| Workaround |
On Mellanox RDMA HCAs the cma_roce_tos tool can be used instead. This tool is part of the mlnx-tools package. |
|
6.0.0.2 |
RDMA |
| IJ57342 |
High Importance
|
Read performance gets affected.
(show details)
| Symptom |
Slow performance |
| Environment |
All OS environments |
| Trigger |
Any read operation on file with size more than afmParallelReadThreshold |
| Workaround |
Disable parallel io |
|
6.0.0.2 |
AFM |
| IJ56142 |
High Importance
|
With workloads that heavily lookup or traverse symlinks, contention can occur inside GPFS. The problem is that every symlink lookup request from an application results in the symlink target being queried from the file system, resulting in possible contention on internal locks.
(show details)
| Symptom |
Performance Impact/Degradation |
| Environment |
ALL Linux OS environments |
| Trigger |
The problem is caused by heavily concurrent lookups of the same symlink by many threads. |
| Workaround |
There is no workaround. |
|
6.0.0.2 |
All Scale Users |
| IJ56891 |
High Importance
|
It is possible for block allocation to be stuck in a loop trying to allocate block on remote nodes. This could happen when there are disks running out of free disk space.
(show details)
| Symptom |
Hang/Deadlock/Unresponsiveness/Long Waiters |
| Environment |
All Scale Users |
| Trigger |
Allow disk to be very close to 100% full |
| Workaround |
Avoid running restripe and suspend disks which are close to 100% full |
|
6.0.0.2 |
All Scale Users |
| IJ56682 |
High Importance
|
GPFS skipped rediscovering active nsds that had a local dev name change.
(show details)
| Symptom |
Unexpected Results/Behavior |
| Environment |
ALL Operating System environments |
| Trigger |
The underlying device name change is the trigger of this problem. |
| Workaround |
Use mmchnsd with local device name change but don't specify the server list, that'll prompt removing the server disk and a reopen. |
|
6.0.0.2 |
All Scale Users |
| IJ57357 |
High Importance
|
When using AFM to Cloud Object Storage (COS), the mmafmtransfer daemon remains running as long as the fileset is active, even when there are no pending transfer requests. This can result in the transfer process staying active unnecessarily instead of terminating after the transfer queues become empty. This is not applicable for SQS enabled RO filesets.
(show details)
| Symptom |
The mmafmtransfer daemon continues to run even when there are no pending AFM transfer requests, leading to unnecessary background activity. |
| Environment |
All OS environments |
| Trigger |
AFM to Cloud Object Storage (COS) is configured and fileset is active. |
| Workaround |
Stop the AFM to COS filesets when not in use except for SQS enabled RO filesets. |
|
6.0.0.2 |
AFM |
| IJ56487 |
Medium Importance |
Changes to the perfmon configurations are not updated on nodes that were down at the time of the change were made.
(show details)
| Symptom |
perfmon configure is not updated. |
| Environment |
Linux Only |
| Trigger |
Perfmon configuration is not updated on nodes that were down. |
| Workaround |
Reissue the mmperfmon command to update the configuration once all nodes are up, or run mmcommon run invokePerfmonctl updateon the perfmon nodes that were down. |
|
6.0.0.1 |
Perfmon |
| IJ56488 |
Suggested |
A hang can occur when three operations hit the same file at once:
a process touches a shared, writable mmap mapping and faults a page,
another thread/process performs mremap (needing the mmap write semaphore), and a concurrent write()/pwrite() to the same region.
Under certain timing, the page-fault path must fetch a file lock from the daemon, while the writer is also fetching a conflicting lock. The result is a lock/semaphore cycle between the page-fault handler, the writer, and mremap, and I/O to that file can stall indefinitely.
(show details)
| Symptom |
Threads hang in file operations; GPFS traces show the mmap page-fault path waiting on a fetched lock, a writer stuck on the mmap semaphore after initiating a daemon fetch, and mremap waiting for the semaphore upgrade. No progress until GPFS services are restarted.
Fix description (high level):
Extend the existing mmap uXfer (“borrowed lock”) fast-path into the daemon fetch path. When the kernel’s lock attempt requires a fetch, the daemon can—under safe conditions—temporarily “borrow” a read lock for the page-fault request and signal the kernel to proceed, breaking the cycle while preserving correctness. (Normal lock/token ownership is finalized once the fetch completes; error paths are handled so the kernel falls back safely if borrowing isn't possible.)
|
| Environment |
ALL OS environments |
| Trigger |
File is mmap'd MAP_SHARED|PROT_WRITE (or read-only with faults against the same region) while a concurrent write()/pwrite() targets the same range.
A mremap occurs concurrently, contending on the mmap semaphore.
Lock acquisition in the kernel returns E_NEED_FETCH and both the page-fault path and the writer rely on the daemon to fetch/upgrade the inode lock; specific timing can create a cyclic wait.
|
| Workaround |
None practical. (Avoiding concurrent mmap access and mremap/writes to the same region prevents the issue but is often not feasible.) |
|
6.0.0.1 |
All Scale Users |
| IJ56619 |
Critical |
When running AIO, the thread submitting the I/O request is not the same as the one completing the I/O request. There is race condition where an AIO request that is quickly completed is still accessed from the submitting threads. This either results in a kernel KFENCE warning or a node crash.
(show details)
| Symptom |
Abend/Crash |
| Environment |
ALL Linux OS environments |
| Trigger |
Run AIO in a way that the requests a completed very quickly. This is workload dependend and might be hard to recreate. |
| Workaround |
There is no workaround, the fix is required to avoid this problem. |
|
6.0.0.1 |
All Scale Users |
|
Suggested |
When applications simultaneously use a writable, shared memory map (mmap) and perform regular write()/pwrite()operations to the same file that is subject to snapshot Copy-on-Write (COW), the file system can hit a three-way deadlock. The cycle involves:
• a page-faulting mmap reader that triggers COW into a previous snapshot,
• a concurrent VMA change (e.g., mremap/munmap) that requires the kernel's mmap write semaphore,
• and a regular write path that holds the inode write lock and then page-faults on its user buffer (which also needs the mmap semaphore).
Once formed, the cycle blocks progress on the affected file and can ultimately lead to automatic deadlock breakup (filesystem panic/unmount) depending on configuration.
(show details)
| Symptom |
•Application threads or system threads hang on file I/O to the affected file.
•Trace/logs show CopyDataOnWriteHandlerThread waiting on inode rf, a writer holding wa and blocked in a page fault, and a VMA operation holding/waiting the mmap write semaphore.
•With deadlock breakup enabled, Scale may log multi-phase “deadlock breakup” and unmount/panic the impacted filesystem. |
| Environment |
All supported OS environments. |
| Trigger |
This issue affects customers that:
•Use writable, shared mmap on files that may require snapshot COW, and
•Perform regular write()/pwrite() to the same file, and
•Occasionally execute VMA-altering operations such as mremap/munmap on the mapping.
A deadlock can occur when:
•An mmap page fault (“PF reader”) triggers CopyDataOnWrite for a prior snapshot and needs the inode rf lock.
•A concurrent writer holds the inode wa lock, then page-faults on its user buffer and must acquire the mmap semaphore.
•A concurrent mremap/munmap seeks the mmap semaphore as writer, blocking page-fault progress.This forms a cycle (PF reader ↔ writer ↔ mremap) that stalls I/O on the file. |
| Workaround |
•Avoid concurrent VMA changes (mremap/munmap) while a file is actively accessed via writable shared mmap and regular writes on a snapshot-eligible file.
•Where feasible, separate write bursts from mmap page-fault activity on the same file, or map readers MAP_PRIVATEif application semantics allow.
(These are operational mitigations only; they do not fully prevent the issue.) |
|
6.0.0.1 |
GPFS/Scale — mmap, snapshot Copy-on-Write, locking. |
IJ56620 |
Critical |
When mmfsck detects a hole in a reserved file, it fills the hole by allocating a new disk address and adding that address to the file’s indirect block. It also updates its internal block allocation bitmap to mark the new block as in-use.
However, the internal block allocation bitmap is distributed across the scanning nodes. If the newly allocated block falls outside the region of the bitmap owned by the node that performed the allocation, the node may skip updating the bitmap. As a result, the block remains unmarked in the bitmap. This leads mmfsck to falsely later identify the block as lost. In repair mode, it then incorrectly marks the block as free. Later, when the file system is in use, it may reallocate this block to another file, resulting in duplicate block corruption.
(show details)
| Symptom |
Operation failure due to FS corruption and SGPanic |
| Environment |
ALL Operating System environments |
| Trigger |
This issue can happen when mmfsck detects and repair holes in reserved files. |
| Workaround |
Run mmfsck in repair mode (-y) again after the first repair run. |
|
6.0.0.1 |
FSCK |
| IJ56253 |
High Importance
|
If a filesystem has quotas enabled and a file is unlinked (its last directory entry removed) before a chown is performed, the chown call will fail with ENOENT, even though the file descriptor remains open and valid.
(show details)
| Symptom |
Unexpected Results/Behavior |
| Environment |
All Operating System environments |
| Trigger |
- Enable quota (file system or fileset level)
- Create a file, unlink it while the file descriptor is still valid.
- Set ownership for this file descriptor.
- Close the file descriptor. |
| Workaround |
Disable quotas |
|
6.0.0.1 |
Quotas |
| IJ56680 |
Suggested |
mmbackup verifies directory size as one of the triggers to select objects to be sent to IBM Storage Protect Server. Since directory size will be calculated during restore, if only size is different, no need to re-backup the directory. Hence, mmbackup will not verify size during during backup candidate selection process if the object is directory.
(show details)
| Symptom |
mmbackup may select unchanged directories as backup candidates |
| Environment |
ALL OS that supports mmbackup |
| Trigger |
run live fs backup and then run snapshot backup |
| Workaround |
none |
|
6.0.0.1 |
mmbackup |
| IJ56142 |
High Importance
|
With workloads that heavily lookup or traverse symlinks, contention can occur inside GPFS. The problem is that every symlink lookup request from an application results in the symlink target being queried from the file system, resulting in possible contention on internal locks.
(show details)
| Symptom |
Performance Impact/Degradation |
| Environment |
ALL Linux OS environments |
| Trigger |
The problem is caused by heavily concurrent lookups of the same symlink by many threads. |
| Workaround |
There is no workaround. |
|
6.0.0.1 |
All Scale Users |
| IJ52020 |
High Importance
|
Background sync could be blocked while reducing allocation region, this could cause other operations such as create/delete snapshot to be blocked.
(show details)
| Symptom |
Performance Impact/Degradation |
| Environment |
ALL Operating System environments |
| Trigger |
Running applications on a client that require new disk space to be allocated. |
| Workaround |
None |
|
6.0.0.1 |
All Scale Users |
| IJ56690 |
High Importance
|
There is a small window of opportunity for the assert to go off.
(show details)
| Symptom |
Abend/Crash |
| Environment |
All platforms |
| Trigger |
DIO workload on a file system of 6.0.0.0+ |
| Workaround |
Disable the assert |
|
6.0.0.1 |
UStore |
| IJ56736 |
Suggested |
Starting in 5.2.3.0 gpfs.base required openssl libraries, specifically for mmfsd. Although the binary required the libraries as some symbols were defined, they were unused. This was introduced with the release of the IBM Storage Scale native REST API feature. All communication from scaleadmd to mmfsd is done over a local Unix Domain Socket and ssl is not in use.
(show details)
| Symptom |
Installs package that is required, but unused |
| Environment |
Linux Only |
| Trigger |
Install gpfs.base |
| Workaround |
None |
|
6.0.0.1 |
Linux Only |
| IJ56734 |
High Importance
|
When reading from snapshot files, applications may encounter unexpected non-zero data in blocks that were never written to in the original (root) filesystem. These blocks were part of a pre-allocated file but remained uninitialized, and therefore should logically contain zeros. The error occurs because the snapshot exposes raw, uninitialized disk contents—garbage data—at these locations. This issue is specific to snapshots and does not occur when reading from the root filesystem, where such blocks are correctly interpreted as zero.
(show details)
| Symptom |
Unexpected Results/Behavior |
| Environment |
ALL Operating System environments |
| Trigger |
If both the snapshot and the root filesystem contain a block that was pre-allocated but never written to, reading from the snapshot may return uninitialized data ("garbage") instead of zeroes. |
| Workaround |
None |
|
6.0.0.1 |
Snapshots |
| IJ56765 |
Suggested |
When an object is created in, copied out of with the '-p' flag, or moved into a GPFS file system on AIX, the “extended entries” flag is always turned on. This will incorrectly show an ACL as always having extended entries, regardless of its contents. This will cause 'ls -e' to display a “+” representing the existence of extended entires once the object is moved out of the GPFS file system, and for 'aclget' to always display that extended entries are enabled.
(show details)
| Symptom |
Unexpected Results/Behavior |
| Environment |
AIX only |
| Trigger |
Create an object in a GPFS file system, move an object into a GPFS file system, or copy an object out of a GPFS file system with the '-p' flag. The problem can be observed by running 'aclget' on that object, or moving the object out of the GPFS file system and running 'ls -e' |
| Workaround |
Use 'aclget' to correctly identify existence of extended entries, ignoring “+” in 'ls -e' |
|
6.0.0.1 |
All Scale Users |
| IJ56564 |
Suggested |
Node ID in component listing can show blank value. It's expected all node ID's show a non-zero integer value
(show details)
| Symptom |
Missing node ID when displaying component information is only symptom. |
| Environment |
Linux Only |
| Trigger |
Issuing a discover component command via the GUI is only known way to induce this error. |
| Workaround |
mmchcomp command line utility can be used to set a blank node ID to a specified value. |
|
6.0.0.1 |
GUI, ESS/GNR |
| IJ56781 |
Suggested |
Setting stat-poll-interval and stat-slot-time to zero does not restore to the automatic adjustments of QoS statistics.
(show details)
| Symptom |
The mmqos command behavior is not consistent with the what documented in the manpage. |
| Environment |
All |
| Trigger |
Setting stat-poll-interval and stat-slot-time to zero does not restore to the automatic |
| Workaround |
Use a null string instead of zero value |
|
6.0.0.1 |
QoS |
| IJ56797 |
Suggested |
File system creation fails when creating a file system with a file system version at PTF level. For example, issuing the command "scalectl filesystem create -n fs0 -d disk1 --version 5.2.3.4", the creation will error out with the following error: "filesystem creation failed: rpc error: code = InvalidArgument desc = specified file system version is outside the supported range 5.2.3.0-5.2.3.0 for the native REST API"
(show details)
| Symptom |
Error output/message |
| Environment |
All Linux |
| Trigger |
file system creation with a PTF version as an argument |
| Workaround |
Specify "--version 5.2.3.0" instead of a PTF version or the default. |
|
6.0.0.1 |
Native Rest API |
| IJ56798 |
High Importance
|
outband Download failing with export map as target and gateway nodes IP used as part of mapping.
(show details)
| Symptom |
Unexpected Results |
| Environment |
Linux Only |
| Trigger |
create mapping with GW's IP and use this export map as target for fileset. Outband download on such a fileset fails. |
| Workaround |
use GW hostname instead of IP in mapping. |
|
6.0.0.1 |
AFM |
| IJ55722 |
High Importance
|
mmaddpdisk --replace failing with error 905 due to stale block device information in PDMaster object.
(show details)
| Symptom |
Error output/message |
| Environment |
Linux Only |
| Trigger |
When device is in replace state, wipe out the drive, and try to add mmaddpdisk --replace |
| Workaround |
Can failover the root LG to a different node, and then you can run this command. But not the correct solution. |
|
6.0.0.1 |
GNR |
| IJ56679 |
High Importance
|
mmfsd hits signal 11 on readSGDesc
(show details)
| Symptom |
Abend/Crash |
| Environment |
All platforms |
| Trigger |
After restart File system manager to break up a deadlock during mmadddisk, some nodes might hit Signal 11, the problem is that the file system manager is processing a more recent SG desc that is read from the disk, before the data structure associated with the new SG descriptor get populated. |
| Workaround |
There is not a work around, scale will crash and restart on itself. |
|
6.0.0.1 |
All Scale Users |